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Hamiltonian dynamics can be used to produce distant proposals for the Metropolis 
algorithm, thereby avoiding the slow exploration of the state space that results from 
the diffusive behaviour of simple random-walk proposals. Though originating in 
physics, Hamiltonian dynamics can be applied to most problems with continuous 
state spaces by simply introducing fictitious "momentum" variables. A key to 
its usefulness is that Hamiltonian dynamics preserves volume, and its trajectories 
can thus be used to define complex mappings without the need to account for a 
hard-to-compute Jacobian factor — a property that can be exactly maintained 
even when the dynamics is approximated by discretizing time. In this review, I 
discuss theoretical and practical aspects of Hamiltonian Monte Carlo, and present 
some of its variations, including using windows of states for deciding on acceptance 
or rejection, computing trajectories using fast approximations, tempering during 
the course of a trajectory to handle isolated modes, and short-cut methods that 
prevent useless trajectories from taking much computation time. 



1 Introduction 



Marko v Chain Monte Carlo (MCMC) originated with the classic paper of [Metropolis et aL 
( 119531 ). where it was used to simulate the distribution of states for a system of ideal- 
i zed molecules. Not long afte r, another approach to molecular simulation was introduced 
( lAlder and Wainwrightl . 119591 ). in which the motion of the molecules was deterministic, fol- 
lowing Newton's laws of motion, which have an elegant formalization as Hamiltonian dy- 
namics. For finding the properties of bulk materials, these approaches are asymptotically 
equivalent, since even in a deterministic simulation, each local region of the material ex- 
periences effectively random influences from distant regions. Despite the large overlap in 
their application areas, the MCMC and molecular dynamics approaches have continued to 
co-exist in the following decades (see iFrenkel and Smitl . 119961 ) . 



In [1987|, a landmark paper by Duane, Kennedy, Pendleton, and Roweth united the MCMC 
and molecular dynamics approaches. They called their method "Hybrid Monte Carlo" , which 
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abbreviates to "HMC" , but the phrase "Hamiltonian Monte Carlo" , retaining the abbrevi- 
ation, is more specific and descriptive, and I will use it here. Duane, et al. applied HMC 
not to molecular simulation, but to lattice field theory simulations of quantum chromody- 
namics. Stat istical applications of HMC began with my use of it for neural network models 
( INeall . Il996af). I also p rovided a statistically-oriented tutorial on HMC in a review of MCMC 



( INeall . Il996af). I also p rovided a statistically-oriented tutorial on HMC in a review of MCMC 
methods ( INeall . 119931 Chapter 5). There have b een other applications of HMC to sta tisti- 
cal p roblems (eg. Ilshwaranj . 119991 ; iSchmidtl . 120091 ) and statistically-oriented reviews (eg, iLiul 
2001, Chapter 9), but HMC still seems to be under-appreciated by statisticians, and perhape 



200ll . Chapter 9), but HMC still seems to be under-appreciated by statisticians, and perhaps 
also by physicists outside the lattice field theory community. 

This review begins by describing Hamiltonian dynamics. Despite terminology that may be 
unfamiliar to non-physicists, the features of Hamiltonian dynamics that are needed for HMC 
are elementary. The differential equations of Hamiltonian dynamics must be discretized for 
computer implementation. The "leapfrog" scheme that is typically used is quite simple. 

Following this introduction to Hamiltonian dynamics, I describe how to use it to construct 
a Markov chain Monte Carlo method. The first step is to define a Hamiltonian function in 
terms of the probability distribution we wish to sample from. In addition to the variables 
we are interested in (the "position" variables), we must introduce auxiliary "momentum" 
variables, which typically have independent Gaussian distributions. The Hamiltonian Monte 
Carlo method alternates simple updates for these momentum variables with Metropolis up- 
dates in which a new state is proposed by computing a trajectory according to Hamiltonian 
dynamics, implemented with the leapfrog method. A state proposed in this way can be 
distant from the current state but nevertheless have a high probability of acceptance. This 
bypasses the slow exploration of the state space that occurs when Metropolis updates are 
done using a simple random-walk proposal distribution. (An alternative way of avoiding 
random walks is to use short trajectories but only partially replace the momentum variables 
between trajectories, so that successive trajectories tend to move in the same direction.) 

After presenting the basic HMC method, I discuss practical issues of tuning the leapfrog 
stepsize and number of leapfrog steps, as well as theoretical results on the scaling of HMC 
with dimensionality. I then present a number of variations on HMC. The acceptance rate for 
HMC can be increased for many problems by looking at "windows" of states at the begin- 
ning and end of the trajectory. For many statistical problems, approximate computation of 
trajectories (eg, using subsets of the data) may be beneficial. Tuning of HMC can be made 
easier using a "short-cut" in which trajectories computed with a bad choice of stepsize take 
little computation time. Finally, "tempering" methods may be useful when multiple isolated 
modes exist. 



2 Hamiltonian dynamics 

Hamiltonian dynamics has a physical interpretation that can provide useful intuitions. In 
two dimensions, we can visualize the dynamics as that of a frictionless puck that slides over a 
surface of varying height. The state of this system consists of the position of the puck, given 
by a 2D vector q, and the momentum of the puck (its mass times its velocity), given by a 2D 
vector p. The potential energy, U(q), of the puck is proportional to the height of the surface 
at its current position, and its kinetic energy, K(p), is equal to \p\ 2 /(2m), where m is the 
mass of the puck. On a level part of the surface, the puck moves at a constant velocity, equal 
to p/m. If it encounters a rising slope, the puck's momentum allows it to continue, with its 
kinetic energy decreasing and its potential energy increasing, until the kinetic energy (and 
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hence p) is zero, at which point it will slide back down (with kinetic energy increasing and 
potential energy decreasing). 

In non-physical MCMC applications of Hamiltonian dynamics, the position will corre- 
spond to the variables of interest. The potential energy will be minus the log of the proba- 
bility density for these variables. Momentum variables, one for each position variable, will 
be introduced artificially. 

These interpretations may help motivate the exposition below, but if you find otherwise, 
the dynamics can also be understood as simply resulting from a certain set of differential 
equations. 



2.1 Hamilton's equations 



Hamiltonian dynamics operates on a (/-dimensional position vector, q, and a (/-dimensional 
momentum vector, p, so that the full state space has 2d dimensions. The system is described 
by a function of q and p known as the Hamiltonian, H(q,p). 

Equations of motion. The partial derivatives of the Hamiltonian determine how q and 
p change over time, t, according to Hamilton's equations: 

<ki = dH 

dt d Pi 1 ' ' 

dpi dH 

— = (2.2) 

dt % 1 J 

for % — 1, . . . , d. For any time interval of duration s, these equations define a mapping, T s , 
from the state at any time t to the state at time t + s. (Here, H, and hence T s , are assumed 
to not depend on t.) 

Alternatively, we can combine the vectors q and p into the vector z = (q,p) with 2d 
dimensions, and write Hamilton's equations as 

% = JVH(z) (2.3) 
where VH is the gradient of H (ie, [VH] k = dH/dz k ), and 

^dxd Idxd 1 . ,. 

J = (2.4) 

. ^Idxd Odxd . 

is a 2d x 2d matrix whose quadrants are defined above in terms identity and zero matrices. 

Potential and kinetic energy. For Hamiltonian Monte Carlo, we usually use Hamilto- 
nian functions that can be written as follows: 

H(q,p) = U(q) + K(p) (2.5) 

Here, U(q) is called the potential energy, and will be defined to be minus the log probability 
density of the distribution for q that we wish to sample, plus any constant that is convenient. 
K{p) is called the kinetic energy, and is usually defined as 

K{p) = p T M~ 1 p/2 (2.6) 
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Here, M is a symmetric, positive-definite "mass matrix", which is typically diagonal, and 
is often a scalar multiple of the identity matrix. This form for K(p) corresponds to minus 
the log probability density (plus a constant) of the zero-mean Gaussian distribution with 
covariance matrix M. 

With these forms for H and K, Hamilton's equations, ( 12. ip and ( 12. 2p . can be written as 
follows, for i — 1, . . . , d: 

*f t = [M-'p], (2.7) 

<k± = -^L (2 8) 

dt dqi 

A one-dimensional example. Consider a simple example in one dimension (for which 
q and p are scalars and will be written without subscripts), in which the Hamiltonian is 
defined as follows: 

H(q,p) = U(q) + K(p), U(q) = q 2 /2, K{p) = p 2 /2 (2.9) 

As we'll see later in Section 13.14 this corresponds to a Gaussian distribution for q with 
mean zero and varia nce one. The dynamics resulting from this Hamiltonian (following equa- 
tions d23D and flUD) is 

dq d P /ami 

Tt =p > Tt=~ q > (2 - 10) 

Solutions have the following form, for some constants r and a: 

q(t) = rcos(a + t), p(t) = — r sm(a + 1) (2-H) 

Hence the mapping T s is a rotation by s radians clockwise around the origin in the (q,p) 
plane. In higher dimensions, Hamiltonian dynamics generally does not have such a simple 
periodic form, but this example does illustrate some important properties that we will look 
at next. 



2.2 Properties of Hamiltonian dynamics 

Several properties of Hamiltonian dynamics are crucial to its use in constructing Markov 
chain Monte Carlo updates. 

Reversibility. First, Hamiltonian dynamics is reversible — the mapping T s from the state 
at time t, (q(t),p(t)), to the state at time t+s, (q(t + s),p(t + s)), is one-to-one, and hence has 
an inverse, T _ s . T his i nvers e mapping is obtained by simply negating the time deriv atives 
in equations ( 12. ip and ( 12. 2ft . When the Hamiltonian has the form in equ atio n ( 12.51) . and 
K{p) = K(—p), as in the quadratic form for the kinetic energy of equation (12.6p . the inverse 
mapping can also be obtained by negating p, applying T s , and then negating p again. 

In the simple ID example of equation (I2.9p . T_ s is just a counter-clockwise rotation by s 
radians, undoing the clockwise rotation of T s . 

The reversibility of Hamiltonian dynamics is important for showing that MCMC updates 
that use the dynamics leave the desired distribution invariant, since this is most easily proved 
by showing reversibility of the Markov chain transitions, which requires reversibility of the 
dynamics used to propose a state. 
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Conservation of the Hamiltonian. A second property of the dynamics i s th at it keeps 
the Hamiltonian invariant (ie, conserved). This is easily seen from equations (12.1 ft and (12. 2h 
as follows: 



alH 
~dt 



d r 



E 



dqi dH dpi dH 
dt dqi dt dpi 



E 



dH dH dH dH 



dpi dq { dqi dp { 



(2-12) 



With the Hamiltonian of equation (12. 9p . the value of th e Hamiltonian is half the squared 
distance from the origin, and the solutions (equation (12. lip ) stay at a constant distance from 
the origin, keeping H constant. 



For Metropolis updates using a proposal found by Hamiltonian dynamics, which form 
part of the HMC method, the acceptance probability is one if H is kept invariant. We will 
see later, however, that in practice we can only make H approximately invariant, and hence 
we will not quite be able to achieve this. 

Volume preservation. A third fundamental property of Hamiltonian dynamics is that it 
preserves volume in (q,p) space (a result known as Liouville's Theorem). If we apply the 
mapping T s to the points in some region R of (q, p) space, with volume V, the image of R 
under T s will also have volume V. 

With the Hamiltonian of equation (12. 9p . the solutions (equation (12. lip ) are rotations, 
which obviously do not change the volume. Such rotations also do not change the shape of a 
region, but this is not so in general — Hamiltonian dynamics might stretch a region in one 
direction, as long as the region is squashed in some other direction so as to preserve volume. 

The significance of volume preservation for MCMC is that we needn't account for any 
change in volume in the acceptance probability for Metropolis updates. If we proposed new 
states using some arbitrary, non-Hamiltonian, dynamics, we would need to compute the 
determinant of the Jacobian matrix for the mapping the dynamics defines, which might well 
be infeasible. 



The preservation of volume by Hamiltonian dynamics can be prove d in sever al wa ys. One 
is to note that the divergence of the vector field defined by equations (12. ip and (12. 2 p is zero, 
which can be seen as follows: 



E 

i=l 



d dqi d dpi 
dqi dt dpi dt 



d r 



E 

2=1 



d dH d dH 



dqi dpi dpi dqi 



E 

i=l 



d 2 H 



d 2 H 



dqidpi dpidqi 



(2.13) 



A vector field with zero divergence can be shown to preserve volume (lArnoldl . 119891 ) . 



Here, I will show informally that Hamiltonian dynamics preserves volume more directly, 
without presuming this property of the divergence. I will, however, take as given that volume 
preservation is equivalent to the determinant of the Jacobian matrix of T s having absolute 
value one, which is related to the well-known role of this determinant in regard to the effect 
of transformations on definite integrals and on probability density functions. 

The 2d x 2d Jacobian matrix of T s , seen as a mapping of z = {q,p), will be written as 
B s . In general, B s will depend on the values of q and p before the mapping. When B s is 
diagonal, it is easy to see that the absolute values of its diagonal elements are the factors by 
which T s stretches or compresses a region in each dimension, so that the product of these 
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factors, which is equal to the absolute value of det(B s ), is the factor by which the volume of 
the region changes. I will not prove the general result here, but note that if we were to (say) 
rotate the coordinate system used, B s would no longer be diagonal, but the determinant 
of B s is invariant to such transformations, and so would still give the factor by which the 
volume changes. 

Let's first consider volume preservation for Hamiltonian dynamics in one dimension (ie, 
with d = 1), for which we can drop the subscripts on p and q. We can approximate T$ for S 
near zero as follows: 



Ts(q,p) 



Q 
P 



+ 5 



dq/dt 
dp/dt 



+ terms of order 5 2 or higher 



(2.14) 



Taking the time derivatives from equations (12. ip and (12.21) . the Jacobian matrix can be 
written as 



Bs 



1 + 5 



-5 



d 2 H 
dqdp 

d 2 H 
dq 2 



5 



d 2 H 



1 - 5 



dp 2 
d 2 H 



dpdq . 



+ terms of order 5 2 or higher 



(2.15) 



We can then write the determinant of this matrix as 



det(B s ) 



5 



d 2 H 



- 5 



d 2 H 



dqdp dpdq 
1 + terms of order 5 2 or higher 



+ terms of order 5 2 or higher 



(2.16) 
(2.17) 

Since log(l + x) ~ x for x near zero, log det (.£?,$) is zero except perhaps for terms of order 
5 2 or higher (though we will see later that it is exactly zero). Now consider logdet(_B s ) for 
some time interval s that is not close to zero. Setting 5 = s/n, for some integer n, we can 
write T s as the composition of T$ applied n times (from n points along the trajectory), so 
det(-B s ) is the n-fold product of det{Bs) evaluated at these points. We then find that 



logdet(5 s ) = ^logdet(£ 6 



(2.18) 



n 

| terms of order 1/n 2 or smaller | 



(2.19) 



terms of order 1/n or smaller 



(2.20) 



Note that the value of Bs in the sum in (12.181) might perhaps vary with i, since the values of 
q and p vary along the trajectory that produces T s . However, assuming that trajectories are 
not singular, the variation in B$ must be bounded along any particular trajectory. Taking 
the limit as n — > oo, we conclude that logdet(5 s ) = 0, so det(B s ) = 1, and hence T s preserves 
volume. 



When d > 1, the same argum ent applies. The Jacobian matrix will now have the following 
form (compare equation (12.151) ). where each entry shown below is a d x d sub-matrix, with 
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rows indexed by i and columns by j 

d 2 H 



B* 



1 + 5 



-5 



dqjdp 

d 2 H 
dqjdqi 



5 



d 2 H 

dpjdpi 



1-5 



3 2 H 
dp j dqt 



+ terms of order 5 2 or higher (2.21) 



As for d = 1, the determinant of this matrix will be one plus terms of order 5 2 or higher, 
since all the terms of order 5 cancel. The remainder of the argument above then applies 
without change. 

Symplecticness. Volume preservation is also a consequence of H ami ltonian dynamics be- 
ing symplectic. Letting z = (q,p), and defining J as in equation fl2.4p . the symplecticness 
condition is that the Jacobian matrix, B s , of the mapping T s satisfies 



B T s J~ l B a = J- 1 



(2.22) 



This implies volume conservation, since det( J BJ)det(J- 1 )det(S s ) = det(J _1 ) implies that 
det(-B s ) 2 is one. When d > 1, the symplecticness condition is stronger than volume preser- 
vation. Hamiltonian dynamics and the symplecticness condition can be generalized to where 
J is any matrix for which J T = — J and det(J) ^ 0. 

Crucially, reversibility, preservation of volume, and symplecticness can be maintained 
exactly even when, as is necessary in practice, Hamiltonian dynamics is approximated, as 
we will see next. 



2.3 Discretizing Hamilton's equations — the leapfrog method 

For computer implementation, Hamilton's equations must be approximated by discretizing 
time, using some small stepsize, e. Starting with the state at time zero, we iteratively 
compute (approximately) the state at times e, 2e, 3e, etc. 

In discussing how to do this, I will assume that the Hamiltonian has the form H(q,p) = 
U(q)+K(p), as in equation (12.51) . Although the methods below can be applied with any form 
for the kinetic energy, I for simplicity assume that K(p) = p T M~ 1 p, as in equation ( 12.61) . 
and furthermore that M is diagonal, with diagonal elements mi, . . . , m^, so that 

Ktp) = ±lL (2.23) 

i=l 

Euler's method. Perhaps the best-known way to approximate the solution to a system 
of differential equations is Euler's method. For Hamilton's equations, this method performs 
the following steps, for each component of position and momentum, indexed by i = 1, . . . , d: 

Pi{t + e) = + e^(t) = Pi (t) - e^-(q(t)) (2.24) 

q .(t + £ ) = q .(t) + A t ) = q .(t) + £ Pl^l (2.25) 

dt rrii 
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The time derivatives above are from the form of Hamilton's equations given by (12.71) and (12. 8p . 
If we start at t — with given values for qi(0) and Pi(0), we can iterate the steps above to 
get a trajectory of position and momentum values at times e, 2e, 3e, . . ., and hence find 
(approximate) values for q{r) and p(r) after r/e steps (assuming r/e is an integer). 

Figure Ufa) shows th e res ult of using Euler's method to approximate the dynamics defined 
by the Hamiltonian of (12. 9p . starting from q(0) = and p(0) = 1, and using a stepsize of 
e = 0.3 for 20 steps (ie, to r = 0.3 x 20 = 6). The results aren't good — Euler's method 
produces a trajectory that diverges to infinity, but the true trajectory is a circle. Using a 
smaller value of e, and correspondingly more steps, produces a more accurate result at r = 6, 
but although the divergence to infinity is slower, it is not eliminated. 

A modification of Euler's method. Much better results can be obtained by slightly 
modifying Euler's method, as follows: 

dU 

Pi(t + e) = Pi (t) - e^(g(t)) (2-26) 

q t (t + e) = «(t) + e Piit + £) (2.27) 

We simply use the new value for the momentum variables, pi, when computing the new 
value for the position variables, g«. A method with similar performance can be obtained by 
instead updating the first and using their new values to update the Pi. 

Figure DJb) shows the results using this modification of Euler's method with e = 0.3. 
Though not perfect, the trajectory it produces is much closer to the true trajectory than 
that obtained using Euler's method, with no tendency to diverge to infinity. This better 
performance is related to the modified method's exact preservation of volume, which helps 
avoid divergence to infinity or spiraling into the origin, since these would typically involve 
the volume expanding to infinity or contracting to zero. 

To see that this modification of Euler's method preserves volume exactly despite the finite 
discretization of ti me, note that both the transformation from (q(t),p(t)) to (q(t),p(t + e)) 
via e quatio n (12.261) and the transformation from (q(t) , p(t + e)) to (q(t + e),p(t + e)) via equa- 
tion (I2.27P are "shear" transformations, in which only some of the variables change (either 
the pi or the g^), by amounts that depend only on the variables that do not change. Any 
shear transformation will preserve volume, since its Jacobian matrix will have determinant 
one (as the only non-zero term in the determinant will be the product of diagonal elements, 
which will all be one). 

The leapfrog method. Even better results can be obtained with the leapfrog method, 
which works as follows: 

Pi(t + e/2) = Pi {t) - ( £ /2)^-(q(t)) (2.28) 

qi (t + e) = gi (t) + £ P^±lI3. (2.29) 

rrii 

dU 

Pi{t + e) = Pi{t + e 2) - (e 2)— (q(t + e)) (2.30) 

oqi 
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(a) Euler's Method, stepsize 0.3 (b) Modified Euler's Method, stepsize 0.3 




-2-1012 -2-101 
position (q) position (q) 



(c) Leapfrog Method, stepsize 0.3 (d) Leapfrog Method, stepsize 1 .2 




-2-1012 -2-101 
position (q) position (q) 



Figure 1: Results using three methods for approximating Hamiltonian dynamics, when 
H(q,p) = q 2 /2 + p 2 /2. The initial state was q — 0, p — 1. The stepsize was e = 0.3 
for (a), (b), and (c), and e = 1.2 for (d). Twenty steps of the simulated trajectory are shown 
for each method, along with the true trajectory (in gray). 
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We start with a half step for the momentum variables, then do a full step for the position 
variables, using the new values of the momentum variables, and finally do another half step 
for the momentum variables, using the new values for the position variables. An analogous 
scheme can be used with any kinetic energy function, with dK/dpi replacing Pi/rrii above. 

When we apply equations f!2.28j) to (12.30P a second time to go from time t + e to t + 2s, 
we can combine the last half step of the first update, from pi(t + e/2) to Pi(t + e), with the 
first half step of the second update, from pi(t + e) to pi(t + e + e/2). The leapf rog metho d 
then looks very similar to the modification of Euler's method in equations (12.271) and ( 12. 26ft . 
except that leapfrog performs half steps for momentum at the very beginning and very end 
of the trajectory, and the time labels of the momentum values computed are shifted by e/2. 

The leapfrog method preserves volume exactly, since each of (12. 28ft to fl 2 . 3 f) are shear 
transformations. Due to its symmetry, it is also reversible by simply negating p, applying 
the same number of steps again, and then negating p again. 

Figure [TJ^c) shows the results using the leapfrog method with a stepsize of e = 0.3, which 
are indistinguishable from the true trajectory, at the scale of this plot. In Figure [T^d), the 
results of using the leapfrog method with e = 1.2 are shown (still with 20 steps, so almost four 
cycles are seen, rather than almost one). With this larger stepsize, the approximation error 
is clearly visible, but the trajectory still remains stable (and will stay stable indefinitely). 
Only when the stepsize approaches e = 2 do the trajectories become unstable. 



Local and global error of discretization methods. I will briefly discuss how the error 
from discretizing the dynamics behaves in the limit as the stepsize, e, goes to zero; Leimkuhler 
and Reich (2004) provide a much more detailed discussion. For useful methods, the error 
goes to zero as e goes to zero, so that any upper limit on the error will apply (apart from 
a usually unknown constant factor) to any different iable function of state — eg, if the error 
for (q,p) is no more than order e 2 , the error for H(q,p) will also be no more than order e 2 . 

The local error is the error after one step, that moves from time t to time t + e. The 
global error is the error after simulating for some fixed time interval, s, which will require s/e 
steps. If the local error is order e p , the global error will be order e 10 " 1 — the local errors of 
order e p accumulate over the s/e steps to give an error of order e v ~ x . If we instead fix e and 
consider increasing the time, s, for which the trajectory is simulated, the error can in general 
increase exponentially with s. Interestingly, however, this is often not what happens when 
simulating Hamiltonian dynamics with a symplectic method, as can be seen in Figure [TJ 

The Euler method and its modification above have order e 2 local error and order e global 
error. The leapfrog method has order e 3 local error and order e 2 global error. As shown by 



Leimkuhler and Reichl (12004L Section 4.3.3) this difference is a consequence of leapfrog being 



reversible, since any reversible method must have global error that is of even order in e. 



3 MCMC from Hamiltonian dynamics 

Using Hamiltonian dynamics to sample from a distribution requires translating the density 
function for this distribution to a potential energy function and introducing "momentum" 
variables to go with the original variables of interest (now seen as "position" variables). We 
can then simulate a Markov chain in which each iteration resamples the momentum and 
then does a Metropolis update with a proposal found using Hamiltonian dynamics. 
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3.1 Probability and the Hamiltonian — canonical distributions 

The distribution we wish to sample can be related to a potential energy function via the 
concept of a canonical distribution from statistical mechanics. Given some energy function, 
E(x), for the state, x, of some physical system, the canonical distribution over states has 
probability or probability density function 

P{x) = ^ exp(-£(x)/T) (3.1) 

Here, T is the temperature of the systemQ, and Z is the normalizing constant needed for this 
function to sum or integrate to one. Viewing this the opposite way, if we are interested in 
some distribution with density function P(x), we can obtain it as a canonical distribution 
with T = 1 by setting E(x) = — log P(x) — log Z, where Z is any convenient positive constant. 

The Hamiltonian is an energy function for the joint state of "position" , q, and "momen- 
tum" , p, and so defines a joint distribution for them, as follows: 

P{q,p) = iexp(-H(q,p)/T) (3.2) 

Note that the invariance of H under Hamiltonian dynamics means that a Hamiltonian trajec- 
tory will (if simulated exactly) move within a hyper-surface of constant probability density. 

If H(q,p) — U(q) + K(p), the joint density is 

P{q,p) = i exp(-U(q)/T) exp(-K(p)/T) (3.3) 

and we see that q and p are independent, and each have canonical distributions, with energy 
functions U(q) and K(p). We will use q to represent the variables of interest, and introduce 
p just to allow Hamiltonian dynamics to operate. 

In Bayesian statistics, the posterior distribution for the model parameters is the usual 
focus of interest, and hence these parameters will take the role of the position, q. We can 
express the posterior distribution as a canonical distribution (with T = 1) using a potential 
energy function defined as follows: 

U(q) = -\og[n(q)L(q\Dj\ (3.4) 
where ir(q) is the prior density, and L(q\D) is the likelihood function given data D. 



3.2 The Hamiltonian Monte Carlo algorithm 

We now have the background needed to present the Hamiltonian Monte Carlo (HMC) algo- 
rithm. HMC can be used to sample only from continuous distributions on R d for which the 
density function can be evaluated (perhaps up to an unknown normalizing constant). For 
the moment, I will also assume that the density is non-zero everywhere (but this is relaxed 
in Section loTTj) . We must also be able to compute the partial derivatives of the log of the 
density function. These derivatives must therefore exist, except perhaps on a set of points 
with probability zero, for which some arbitrary value could be returned. 



Note to physicists: I assume here that temperature is measured in units that make Boltzmann's constant be one. 
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HMC samples from the canonical distribution for q and p defined by equation (I3.3p . in 
which q has the distribution of interest, as specified using the potential energy function 
[/(g). We can choose the distribution of the momentum variables, p, which are independent 
of q, as we wish, specifying the distribution via the kinetic energy funct ion, K(p). Current 
practice with HMC is to use a quadratic kinetic energy, as in equation (12. 6p . which leads p 
to have a zero-mean multivariate Gaussian distribution. Most often, the components of p 
are specified to be independent, with component i having variance m r The kinetic energy 
function producing this distribution (setting T = 1) is 

m = (3.5) 

i=i 1 

We will see in Section H] how the choice for the affects performance. 

The two steps of the HMC algorithm. Each iteration of the HMC algorithm has two 
steps. The first changes only the momentum; the second may change both position and 
momentum. Both steps leave the canonical joint distribution of (q,p) invariant, and hence 
their combination also leaves this distribution invariant. 



In the first step, new values for the momentum variables are randomly drawn from their 
Gaussian distribution, independently of the current values of the position variables. For 
the kinetic energy of equation (13. 5p . the d momentum variables are independent, with pi 
having mean zero and variance m ; . Since q isn't changed, and p is drawn from it's correct 
conditional distribution given q (the same as its marginal distribution, due to independence), 
this step obviously leaves the canonical joint distribution invariant. 

In the second step, a Metropolis update is performed, using Hamiltonian dynamics to pro- 
pose a new state. Starting with the current state, (q,p), Hamiltonian dynamics is simulated 
for L steps using the Leapfrog method (or some other reversible method that preserves vol- 
ume), with a stepsize of e. Here, L and e are parameters of the a lgor ithm, which need to be 
tuned to obtain good performance (as discussed below in Section l4~2l) . The momentum vari- 
ables at the end of this L-step trajectory are then negated, giving a proposed state (q*,p*). 
This proposed state is accepted as the next state of the Markov chain with probability 



mm 



l,exp(-H(q*,p*)+H(q,p)) = min 1, exp(-[/(g*) + 17(g) - K{p*) + K{p)) (3.6) 



If the proposed state is not accepted (ie, it is rejected), the next state is the same as the 
current state (and is counted again when estimating the expectation of some function of state 
by its average over states of the Markov chain). The negation of the momentum variables 
at the end of the trajectory makes the Metropolis proposal symmetrical, as needed for the 
acceptance probability above to be valid. This negation need not be done in practice, since 
K(p) = K(—p), and the momentum will be replaced before it is used again, in the first step 
of the next iteration. (This assumes that these HMC updates are the only ones performed.) 



If we look at HMC as sampling from the joint distribution of q and p, the Metropolis step 
using a proposal found by Hamiltonian dynamics leaves the probability density for (q,p) 
unchanged or almost unchanged. Movement to (q,p) points with a different probability 
density is accomplished only by the first step in an HMC iteration, in which p is replaced by 
a new value. Fortunately, this replacement of p can change the probability density for (q,p) 
by a large amount, so movement to points with a different probability density is not a problem 
(at least not for this reason). Looked at in terms of q only, Hamiltonian dynamics for (q,p) 
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can produce a value for q with a much different probability density (equivalently, a much 
different potential energy, U(q)). However, the resampling of the momentum variables is still 
crucial to obtaining the proper distribution for q. Without resampling, H(q,p) = U(q)+K(p) 
will be (nearly) constant, and since K(p) and U(q) are non-negative, U(q) could never exceed 
the initial value of H(q,p) if no resampling for p were done. 

A function that implements a single iteration of the HMC algorithm, written in the R 
languag^l is shown in Figure EJ Its first two arguments are functions — U, which returns 
the potential energy given a value for q, and gracLU, which returns the vector of partial 
derivatives of U given q. Other arguments are the stepsize, epsilon, for leapfrog steps, the 
number of leapfrog steps in the trajectory, L, and the current position, current_q, that the 
trajectory starts from. Momentum variables are sampled within this function, and discarded 
at the end, with only the next position being returned. The kinetic energy is assumed to have 
the simplest form, K(p) = ^2pj/2 (ie, all are one). In this program, all components of p 
and of q are updated simultaneously, using vector operations. This simple implementation 
of HMC is available from my web pag43 along with other R programs with extra features 
helpful for practical use, and that illustrate some of the variants of HMC in Section [5j 

Proof that HMC leaves the canonical distribution invariant. The Metropolis up- 
date above is reversible with respect to the canonical distribution for q and p (with T = 1), a 
condition also known as "detailed balance" , and which can be phrased informally as follows. 
Suppose we partition the (q,p) space into regions Ak, each with the same small volume V. 
Let the image of Ak with respect to the operation of L leapfrog steps, plus a negation of the 
momentum, be Bk. Due to the reversibility of the leapfrog steps, the Bk will also partition 
the space, and since the leapfrog steps preserve volume (as does negation), each Bk will also 
have volume V. Detailed balance holds if, for all i and j, 

P{A,)T{B 3 \A t ) = P(Bj)T(A i \B j ) (3.7) 

where P is probability under the canonical distribution, and T(X\Y) is the conditional 
probability of proposing and then accepting a move to region X if the curre nt st ate is in 
region Y. Clearly, when i ^ j, T(Ai\Bj) = T(Bj\Ai) = and so equation f !3.Tj) will be 
satisfied. Since the Hamiltonian is continuous almost everywhere, in the limit as the regions 
Ak and Bk become smaller, the Hamiltonian becomes effectively constant within each region, 
with value Hx in region X, and hence the canonical probability density and the transition 
probabilit ies b ecome effectively constant within each region as well. We can now rewrite 
equation (I3.7P for i = j (say both equal to k) as 



V 



r l v r i 

exp(-F^J min 1, exp(-H Bk +H Ak )^ = — exp(-if Bfe ) min|l, exp(-H Ak +H Bk )^ (3.1 



Z 

which is easily seen to be true. 

Detailed balance implies that this Metropolis update leaves the canonical distribution 
for q and p invariant. This can be seen as follows. Let R(X) be the probability that the 
Metropolis update for a state in the small region X leads to rejection of the proposed state. 
Suppose that the current state is distributed according to the canonical distribution. The 
probability that the next state is in a small region Bk is the sum of the probability that the 
current state is in B k and the update leads to rejection, and the probability that the current 

2 R is available for free from r-project . org 
3 At www.cs.utoronto.ca/~radford 
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HMC = function (U, grad_U, epsilon, L, current_q) 
{ 

q = current_q 

p = rnorm (length (q) ,0, 1) # independent standard normal variates 
current _p = p 

# Make a half step for momentum at the beginning 
p = p - epsilon * grad_U(q) / 2 

# Alternate full steps for position and momentum 

for (i in 1:L) 
{ 

# Make a full step for the position 
q = q + epsilon * p 

# Make a full step for the momentum, except at end of trajectory 
if (i!=L) p = p - epsilon * grad_U(q) 

} 

# Make a half step for momentum at the end. 
p = p - epsilon * grad_U(q) / 2 

# Negate momentum at end of trajectory to make the proposal symmetric 
p = -p 

# Evaluate potential and kinetic energies at start and end of trajectory 

current_U = U(current_q) 
current_K = sum(current_p~2) / 2 
proposed_U = U(q) 
proposed_K = sum(p~2) / 2 

# Accept or reject the state at end of trajectory, returning either 

# the position at the end of the trajectory or the initial position 

if (runif(l) < exp(current_U-proposed_U+current_K-proposed_K) ) 
{ 

return (q) # accept 

} 

else 
{ 

return (current_q) # reject 

} 



Figure 2: The Hamiltonian Monte Carlo algorithm 
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state is in some region from which a move to B k is proposed and accepted. The probability 
of the next state being in B k can therefore be written as 

P(B k )R(B k ) + ^P(A)T(£,iA) = P{B k )R{B k ) + ^P(B fc )T(^|B fc ) (3.9) 

i i 

= P{B k )R(B k ) + P(B k )^2T(Ai\B k ) (3.10) 

i 

= P(B k )R(B k ) + P(B k )(l - R(B k )) (3.11) 

= P(B k ) (3.12) 

The Metropolis update within HMC therefore leaves the canonical distribution invariant. 

Since both the sampling of momentum variables and the Metropolis update with a pro- 
posal found by Hamiltonian dynamics leave the canonical distribution invariant, the HMC 
algorithm whole does as well. 

Ergodicity of HMC. Typically, the HMC algorithm will also be "ergodic" — it will not 
be trapped in some subset of the state space, and hence will asymptotically converge to 
its (unique) invariant distribution. In an HMC iteration, any value can be sampled for the 
momentum variables, which can typically then affect the position variables in arbitrary ways. 
However, ergodicity can fail if the L leapfrog steps in a trajectory produce an exact perio dicit y 
for some function of state. For exampl e, wi th the simple Hamiltonian of equation (12. 9ft . 
the exact solutions (given by equation ( 12.111) ) are periodic with period 2tt. Approximate 
trajectories found with L leapfrog steps with stepsize e may return to the same position 
coordinate when Le is approximately 2ir. HMC with such values for L and e will not be 
ergodic. For nearby values of L and e, HMC may be theoretically ergodic, but take a very 
long time to move about the full state space. 

This potential problem of non-ergodici ty can be s olved by randomly choosing e or L 
(or both) from some fairly small interval ( Mackenzie] . 119891 ). Doing this routinely may be 



advisable. Although in real problems interactions between variables typically prevent any 
exact periodicities from occurring, near periodicities might still slow HMC considerably. 



3.3 Illustrations of HMC and its benefits 

Here, I will illustrate some practical issues with HMC, and demonstrate its potential to 
sample much more efficiently than simple methods such as random-walk Metropolis. I use 
simple Gaussian distributions for these demonstrations, so that the results can be compared 
with known values, but of course HMC is typically used for more complex distributions. 

Trajectories for a two-dimensional problem. Consider sampling from a distribution 
for two variables that is bivariate Gaussian, with means of zero, standard deviations of 
one, and correlation 0.95. We regard these as "position" variables, and introduce two cor- 
responding "momentum" variables, defined to have a Gaussian distribution with means of 
zero, standard deviations of one, and zero correlation. We then define the Hamiltonian as 

H(q,p) = q T YT l q/2 + p T p / 2, with £ = 



1 0.95 
0.95 1 



(3.13) 
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Position coordinates Momentum coordinates Value of Hamiltonian 




-2 -1 1 2 -2 -1 1 2 5 10 15 20 25 



Figure 3: A trajectory for a 2D Gaussian distribution, simulated using 25 leapfrog steps 
with a stepsize of 0.25. The ellipses plotted are one standard deviation from the means. The 
initial state had q = [—1.50, — 1.55] T and p = [—1, 1] T . 

Figure |3] shows a trajectory based on this Hamiltonian, such as might be used to propose 
a new state in the Hamiltonian Monte Carlo method, computed using L = 25 leapfrog steps, 
with a stepsize of e — 0.25. Since the full state space is four-dimensional, the Figure shows 
the two position coordinates and the two momentum coordinates in separate plots, while 
the third plot shows the value of the Hamiltonian after each leapfrog step. 

Notice that this trajectory does not resemble a random-walk. Instead, starting from the 
lower- left corner, the position variables systematically move upwards and to the right, until 
they reach the upper-right corner, at which point the direction of motion is reversed. The 
consistency of this motion results from the role of the momentum variables. The projection 
of p in the diagonal direction will change only slowly, since the gradient in that direction is 
small, and hence the direction of diagonal motion stays the same for many leapfrog steps. 
While this large-scale diagonal motion is happening, smaller-scale oscillations occur, moving 
back and forth across the "valley" created by the high correlation between the variables. 

The need to keep these smaller oscillations under control limits the stepsize that can be 
used. As can be seen in the rightmost plot in Figure [3J there are also oscillations in the value 
of the Hamiltonian (which would be constant if the trajectory were simulated exactly). If a 
larger stepsize were used, these oscillations would be larger. At a critical stepsize (e = 0.45 
in this example), the trajectory becomes unstable, and the value of the Hamiltonian grows 
without bound. As long as the stepsize is less than this, however, the error in the Hamiltonian 
stays bounded regardless of the number of leapfrog steps done. This lack of growth in the 
error is not guaranteed for all Hamiltonians, but it does hold for many distributions more 
complex than Gaussians. As can be seen, however, the error in the Hamiltonian along the 
trajectory does tend to be positive more often than negative. In this example, the error is 
+0.41 at the end of the trajectory, so if this trajectory were used for an HMC proposal, the 
probability of accepting the end-point as the next state would be exp(— 0.41) = 0.66. 

Sampling from a two-dimensional distribution. Figures H] and |5] show the results 
of using HMC and a simple random-walk Metropolis method to sample from a bivariate 
Gaussian similar to the one just discussed, but with stronger correlation of 0.98. 

In this example, as in the previous one, HMC used a kinetic energy (defining the momen- 
tum distribution) of K(p) = p T p/2. The results of 20 HMC iterations, using trajectories of 
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Random-walk Metropolis Hamiltonian Monte Carlo 




-2-1012 -2-101 



Figure 4: Twenty iterations of the random-walk Metropolis method (with 20 updates per 
iteration) and of the Hamiltonian Monte Carlo method (with 20 leapfrog steps per trajectory) 
for a 2D Gaussian distribution with marginal standard deviations of one and correlation 0.98. 
Only the two position coordinates are plotted, with ellipses drawn one standard deviation 
away from the mean. 



Random-walk Metropolis 



Hamiltonian Monte Carlo 




03 

13 

: — 

o 
o 



o 

CL 



• ."V . . 



m — ~ www 

_ A • _ 



I 



I 



I 

50 



100 



150 



200 



Figure 5: Two hundred iterations, starting with the twenty iterations shown above, with 
only the first position coordinate plotted. 
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L = 20 leapfrog steps with stepsize e = 0.18 are shown in the right plot of Figure HI These 
values were chosen so that the trajectory length, eL, is sufficient to move to a distant point 
in the distribution, without being so large that the trajectory will often waste computation 
time by doubling back on itself. The rejection rate for these trajectories was 0.09. 

Figure H] also shows every 20th state from 400 iterations of random- walk Metropolis, with a 
bivariate Gaussian proposal distribution with the current state as mean, zero correlation, and 
the same standard deviation for the two coordinates. The standard deviation of the proposals 
for this example was 0.18, which is the same as the stepsize used for HMC proposals, so that 
the change in state in these random-walk proposals was comparable to that for a single 
leapfrog step for HMC. The rejection rate for these random- walk proposals was 0.37. 

One can see in Figure H]how the systematic motion during an HMC trajectory (illustrated 
in Figure [3]) produces larger changes in state than a corresponding number of random-walk 
Metropolis iterations. Figure |5] illustrates this difference for longer runs of 20 x 200 random- 
walk Metropolis iterations and of 200 HMC iterations. 

The benefit of avoiding random walks. Avoidance of random-walk behaviour, as il- 
lustrated above, is one major benefit of Hamiltonian Monte Carlo. In this example, because 
of the high correlation between the two position variables, keeping the acceptance proba- 
bility for random-walk Metropolis reasonably high requires that the changes proposed have 
a magnitude comparable to the standard deviation in the most constrained direction (0.14 
in this example, the square root of the smallest eigenvalue of the covariance matrix). The 
changes produced using one Gibbs sampling scan would be of similar magnitude. The num- 
ber of iterations needed to reach a state almost independent of the current state is mostly 
determined by how long it takes to explore the less constrained direction, which for this 
example has standard deviation 1.41 — about ten times greater than the standard deviation 
in the most constrained direction. We might therefore expect that we would need around 
ten iterations of random-walk Metropolis in which the proposal was accepted to move to 
a nearly independent state. But the number needed is actually roughly the square of this 
- around 100 iterations with accepted proposals — because the random-walk Metropolis 
proposals have no tendency to move consistently in the same direction. 

To see this, note that the variance of the position after n iterations of random walk 
Metropolis from some start state will grow in proportion to n (until this variance becomes 
comparable to the overall variance of the state), since the position is the sum of mostly 
independent movements for each iteration. The standard deviation of the amount moved 
(which gives the typical amount of movement) is therefore proportional to \fn. 

The stepsize used for the leapfrog steps is similarly limited by the most constrained 
direction, but the movement will be in the same direction for many steps. The distance 
moved after n steps will therefore tend to be proportional to n, until the distance moved 
becomes comparable to the overall width of the distribution. The advantage compared to 
movement by a random walk will be a factor roughly equal to the ratio of the standard 
deviations in the least confined direction and most confined direction — about 10 here. 

Because avoiding a random walk is so beneficial, the optimal standard deviation for 
random-walk Metropolis proposals in this example is actually much larger than the value of 
0.18 used here. A proposal standard deviation of 2.0 gives a very low acceptance rate (0.06), 
but this is more than compensated for by the large movement (to a nearly independent 
point) on the rare occasions when a proposal is accepted, producing a method that is about 
as efficient as HMC. However, this strategy of making large changes with a small acceptance 
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Figure 6: Values for the variable with largest standard deviation for the 100- dimensional 
example, from a random-walk Metropolis run and an HMC run with L = 150. To match 
computation time, 150 updates were counted as one iteration for random-walk Metropolis. 

rate works only when, as here, the distribution is tightly constrained in only one direction. 

Sampling from a 100-dimensional distribution. More typical behaviour of HMC and 
random-walk Metropolis is illustrated by a 100-dimensional multivariate Gaussian distribu- 
tion in which the variables are independent, with means of zero, and standard deviations of 
0.01, 0.02, . . . , 0.99, 1.00. Suppose we have no knowledge of the details of this distribution, 
so we will use HMC with the same simple, rotationally symmetric kinetic energy function as 
above, K(p) = p T p/2, and use random- walk Metropolis proposals in which changes to each 
vari able are independent, all with the same standard deviation. As discussed below in Sec- 
tion 14.11 the performance of both these sampling methods is invariant to rotation, so this ex- 
ample is illustrative of how they perform on any multivariate Gaussian distribution in which 
the square roots of the eigenvalues of the covariance matrix are 0.01, 0.02, . . . , 0.99, 1.00. 

For this problem, the position coordinates, gj, and corresponding momentum coordinates, 
Pi, are all independent, so the leapfrog steps used to simulate a trajectory operate indepen- 
dently for each (qi,Pi) pair. However, whether the trajectory is accepted depends on the 
total error in the Hamiltonian due to the leapfrog discretization, which is a sum of the errors 
due to each (qi,Pi) pair (for the terms in the Hamiltonian involving this pair). Keeping this 
error small requires limiting the leapfrog stepsize to a value roughly equal to the smallest 
of the standard deviations (0.01), which implies that many leapfrog steps will be needed to 
move a distance comparable to the largest of the standard deviations (1.00). 

Consistent with this, I applied HMC to this distribution using trajectories with L = 150 
and with e randomly selected for each iteration, uniformly from (0.0104,0.0156), which 
is 0.013 ± 20%. I used random-walk Metropolis with proposal standard deviation drawn 
uniformly from (0.0176,0.0264), which is 0.022 ± 20%. These are close to optimal settings 
for both methods. The rejection rate was 0.13 for HMC and 0.75 for random-walk Metropolis. 

Figure [6] shows results from runs of 1000 iterations of HMC (right) and of random-walk 
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Figure 7: Estimates of means (top) and standard deviations (bottom) for the 100-dimensional 
example, using random- walk Metropolis (left) and HMC (right). The 100 variables are 
labelled on the horizontal axes by the true standard deviaton of that variable. Estimates are 
on the vertical axes. 
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Metropolis (left), counting 150 random- walk Metropolis updates as one iteration, so that 
the computation time per iteration is comparable to that for HMC. The plot shows the last 
variable, with the largest standard deviation. The autocorrelation of these values is clearly 
much higher for random-walk Metropolis than for HMC. Figure shows the estimates for 
the mean and standard deviation of each of the 100 variables obtained using the HMC and 
random-walk Metropolis runs (estimates were just the sample means and sample standard 
deviations of the values from the 1000 iterations). Except for the first few variables (with 
smallest standard deviations), the error in the mean estimates from HMC is roughly 10 
times less than the error in the mean estimates from random-walk Metropolis. The standard 
deviation estimates from HMC are also better. 

The randomization of the l eapf rog stepsize done in this example follows the advice dis- 
cussed at the end of Section 13 .21 In this example, not randomizing the stepsize (fixing 
e = 0.013) does in fact cause problems — the variables with standard deviations near 0.31 
or 0.62 change only slowly, since 150 leapfrog steps with e = 0.013 produces nearly a full or 
half cycle for these variables, so an accepted trajectory does not make much of a change in 
the absolute value of the variable. 



4 HMC in practice and theory 

Obtaining the benefits from HMC illustrated in the previous section, including random- walk 
avoidance, requires proper tuning of L and e. I discuss tuning of HMC below, and also 
show how performance can be improved by using whatever knowledge is available regarding 
the scales of variables and their correlations. After briefly discussing what to do when 
HMC alone is not enough, I discuss an additional benefit of HMC — its better scaling with 
dimensionality than simple Metropolis methods. 



4.1 Effect of linear transformations 

Like all MCMC methods I'm aware of, the performance of HMC may change if the variables 
being sampled are transformed by multiplication by some non-singular matrix, A. However, 
performance stays the same (except perhaps in terms of computation time per iteration) if 
at the same time the corresponding momentum variables are multiplied by (A T )~ l . These 
facts provide insight into the operation of HMC, and can help us improve performance when 
we have some knowledge of the scales and correlations of the variables. 

Let the new variables be q' = Aq. The probability density for q' will be given by 
P'{q') = P(A~ 1 q') / \det(A)\, where P(q) is the density for q. If the distri but ion for q 
is the canonical distribution for a potential energy function U(q) (see Section IBTTj) . we can 
obtain the distribution for q' as the canonical distribution for U'(q') = U(A~ l q'). (Since 
|det(v4)| is a constant, we needn't include a log |det(A)| term in the potential energy.) 

We can choose whatever distribution we wish for the corresponding momentum variables, 
so we could decide to use the same kinetic energy as before. Alternatively, we can choose 
to transform the momentum variables by p' = (A T )~ 1 p, and use a new kinetic energy of 
K'(p') = K(A T p'). If we were using a quadratic kinetic energy, K(p) = p T M~ l p / 2 (see 
equation (12.6]) ). the new kinetic energy will be 



K'(p) = {A T p') T M-\A T p') /2 = {p'fiAM^A^p 1 /2 = (p') T (M , )~V / 2 (4.1) 
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where M' = (AM~ 1 A T )~ 1 = (A ) T M A . 

If we use momentum variables transformed in this way, the dynamics for the new variables, 
(q',p'), essentially replicates the original dynamics for (q,p), so the performance of HMC will 
be the same. To see this, note that if we follow Hamiltonian dyna mics for (q ',p') , the result 
in terms of the original variables will be as follows (see equations (12. 7p and (12.80 ): 

= A' 1 ^j- = A-^ikTrV = A' 1 (A M^ 1 A T )(A T )~ 1 p = M" 1 p (4.2) 

-j- = A T —y- = -A T VU'(q') = -A T (A- 1 ) T VU(A- 1 q') = -VU(q) (4.3) 
at at 

which matches what would happen following Hamiltonian dynamics for (q,p). 

If A is an orthogonal matrix (such as a rotation matrix), for which A^ 1 = A T , the 
performance of HMC is unchanged if we transform both q and p by multiplying by A (since 
( J 4 T )~ 1 — A). If we chose a rotationally symmetric distribution for the momentum, with 
M = ml (ie, the momentum variables are independent, each having variance m), such an 
orthogonal transformation will not change the kinetic energy function (and hence not change 
the distribution of the momentum variables), since we will have M' = (A (mI)~ 1 A T )~ 1 = ml. 

Such an invariance to rotation holds also for a random-walk Metropolis method in which 
the proposal distribution is rotationally symmetric (eg, Gaussian with covariance matrix 
ml). In contrast, Gibbs sampling is not rotationally invariant, nor is a scheme in which the 
Metropolis algorithm is used to update each variable in turn (with a proposal that changes 
only that variable). However, Gibbs sampling is invariant to rescaling of the variables (trans- 
formation by a diagonal matrix), which is not true for HMC or random- walk Metropolis, 
unless the kinetic energy or proposal distribution is transformed in a corresponding way. 

Suppose we have an estimate, E, of the covariance matrix for q, and suppose also that q 
has at least a roughly Gaussian distribution. How can we use this information to improve 
the performance of HMC? One way is to transform the variables so that their covariance 
matrix is close to the identity, by finding the Cholesky decomposition, E = LL T , with L 
being lower-triangular, and letting q' = L~ x q. We then let our kinetic energy function be 
K(p) = p T p 1 2. Since the momentum variables are independent, and the position variables 
are close to independent with variances close to one (if our estimate E, and assumption that 
q is close to Gaussian are good), HMC should perform well using trajectories with a small 
number of leapfrog steps, which will move all variables to a nearly independent point. More 
realistically, the estimate E may not be very good, but this transformation could still improve 
performance compared to using the same kinetic energy with the original q variables. 

An equivalent way to make use of the estimated covariance E is to keep the original q 
variables, but use the kinetic energy function K(p) = p T Hpj1 — ie, we let the momentum 
variables have covariance E _1 . The equivalence can be seen by transforming this kinetic 
energy to correspond to a transformation to q' = L~ 1 q (see equation ( 14. ip ). which gives 
K{p') = (p') T M'~ l p' with M' = (L" 1 (LL T )(L- 1 ) T )- 1 = /. 

Using such a kinetic energy function to compe nsate for correl ations between position 
variables has a long history in molecular dynamics ( iBennettl . 119751 ) . The usefulness of this 



technique is limited by the computational cost of matrix operations when the dimensionality 
is high. 
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Using a diagonal S can be feasible even in high- dimensional problems. Of course, this 
provides information only about the different scales of the variables, not their correlation. 
Moreover, when the actual correlations are non-zero, it is not clear what scales to use. Making 
an optimal choice is probably infeasible. Some approximation to the conditional standard 
deviation of each variabl e given all th e others may be possible — as I have done for Bayesian 
neural network models ( iNeall . Il996al ). If this also is not feasible, using approximations to 
the marginal standard deviations of the variables may be better than using the same scale 
for them all. 



4.2 Tuning HMC 

One practical impediment to the use of Hamiltonian Monte Carlo is the need to select 
suitable values for the leapfrog stepsize, e, and the number of leapfrog steps, L, which 
together determine the length of the trajectory in fictitious time, eL. Most MCMC methods 
have parameters that need to be tuned, with the notable exception of Gibbs sampling when 
the conditional distributions are amenable to direct sampling. However, tuning HMC is 
more difficult in some respects than tuning a simple Metropolis method. 

Preliminary runs and trace plots. Tuning HMC will usually require preliminary runs 
with trial values for e and L. In judging how well these runs work, trace plots of quantities 
that are thought to be indicative of overall convergence should be examined. For Bayesian 
inference problems, high-level hyperparameters are often among the slowest-moving quanti- 
ties. The value of the potential energy function, U(q), is also usually of central significance. 
The autocorrelation for such quantities indicates how well the Markov chain is exploring the 
state space. Ideally, we would like the state after one HMC iteration to be nearly independent 
of the previous state. 

Unfortunately, preliminary runs can be misleading, if they are not long enough to have 
reached equilibrium. It is possible that the best choices of e and L for reaching equilibrium 
are different from the best choices once equilibrium is reached, and even at equilibrium, it is 
possible that the best choices vary from one place to another. If necessary, at each iteration 
of HMC, e and L can be chosen randomly from a selection of values that are appropriate for 
different parts of the state space (or these selections and can be used sequentially). 

Doing several runs with different random starting states is advisable (for both preliminary 
and final runs), so that problems with isolated modes can be detected. Note that HMC is no 
less (or more) vulnerable to problems with isolated modes than other MCMC methods that 
make local changes to the state. If isolated modes are found to exist, something needs to be 
done to solve this problem — just combining runs that are each confined to a sin gle m ode 
is not valid. A modification of HMC with "tempering" along a trajectory (Section 15.71) can 
sometimes help with multiple modes. 

What stepsize? Selecting a suitable leapfrog stepsize, e, is crucial. Too large a stepsize 
will result in a very low acceptance rate for states proposed by simulating trajectories. Too 
small a stepsize will either waste computation time, by the same factor as the stepsize is too 
small, or (worse) will lead to slow exploration by a random walk, if the trajectory length, 
eL, is then too short (ie, L is not large enough, see below). 

Fortunately, as illustrated in Figure [3j the choice of stepsize is almost independent of 
how many leapfrog steps are done. The error in the value of the Hamiltonian (which will 
determine the rejection rate) usually does not increase with the number of leapfrog steps, 
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provided that the stepsize is small enough that the dynamics is stable. 

The issue of stability can be seen in a simple one- dimensional problem in which the 
following Hamiltonian is used: 

H(q,p) = q 2 /2a 2 + p 2 / 2 (4.4) 

The distribution for q that this defines is Gaussian with standard deviation a. A leapfrog step 
for this system (as for any quadratic Hamilto nian) will be a linear mapping from (q(t),p(t)) 
to (q(t + e),p(t + e)). Referring to equations (12.281) to (12.30!) . we see that this mapping can 
be represented by a matrix multiplication as follows: 



q(t + e) ' 




l-e 2 /2a 2 e 




' q(t) - 


p{t + e) _ 




-e/a 2 + e 3 /4a* l-e 2 /2a 2 







(4.5) 



Whether iterating this mapping leads to a stable trajectory, or one that diverges to infinity, 
depends on the magnitudes of the eigenvalues of the above matrix, which are 

(l-e 2 /2a 2 ) ± (e/a)y/e 2 /4a 2 - 1 (4.6) 

When e/a > 2, these eigenvalues are real, and at least one will have absolute value greater 
than one. Trajectories computed using the leapfrog method with this e will therefore be 
unstable. When e/a < 2, the eigenvalues are complex, and both have squared magnitude of 

(l-e 2 /2a 2 ) 2 + (e 2 /a 2 )(l-e 2 /Aa 2 ) = 1 (4.7) 
Trajectories computed with e < 2a are therefore stable. 

For multi-dimensional problems in which the kinetic energy used is K(p) = p T p/2 (as in 
the example above), the stability limit for e will be determined (roughly) by the width of 
the distribution in the most constrained direction — for a Gaussian distribution, this would 
the square root of the smallest eigenvalue of the covariance matrix for q. Stability for more 
general quadratic Hamiltonians with K(p) = p T M~ 1 p/2 can be determined by a pply ing a 
linear transformation that makes K(p') = (p') T p'/2, as discussed above in Section H~T1 

When a stepsize, e, that produces unstable trajectories is used, the value of H grows 
exponentially with L, and consequently the acceptance probability will be extremely small. 
For low- dimensional problems, using a value for e that is just a bit below the stability limit 
is sufficient to produce a good acceptance rate. For high-dimensional problems, however, 
the stepsize may need to be reduced further than this to keep the error in if to a level that 
produces a good acceptance probability. This is discussed further in Section 14.41 

Choosing too large a value of e can have very bad effects on the performance of HMC. 
In this respect, HMC is more sensitive to tuning than random- walk Metropolis. A standard 
deviation for proposals needs to be chosen for random-walk Metropolis, but performance 
degrades smoothly as this choice is made too large, without the sharp degradation seen 
with HMC when e exceeds the stability limit. (However, in high- dimensional problems, the 
degradation in random-walk Metropolis with too large a proposal standard deviation can 
also be quite sharp, so this distinction becomes less clear.) 

This sharp degradation in performance of HMC when the stepsize is too big would not 
be a serious issue if the stability limit were constant — the problem would be obvious from 
preliminary runs, and so could be fixed. The real danger is that the stability limit may differ 
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for several regions of the state space that all have substantial probability. If the preliminary 
runs are started in a region where the stability limit is large, a choice of £ a bit less than this 
limit might appear to be appropriate. However, if this e is above the stability limit for some 
other region, the runs may never visit this region, even though it has substantial probability, 
producing a drastically wrong result. To see why this could happen, note that if the run 
ever does visit the region where the chosen e would produce instability, it will stay there for 
a very long time, since the acceptance probability with that e will be very small. Since the 
method nevertheless leaves the correct distribution invariant, it follows that the run only 
rarely moves to this region from a region where the chosen e leads to stable trajectories. 
One simple context where this problem can arise is when sampling from a distribution with 
very light tails (lighter than a Gaussian distribution), for which the log of the density will 
fall faster than quadratically. In the tails, the gradient of the log den s ity wi ll be large, and a 



small stepsize will be needed for stability. See iRoberts and Tweedid (]1996[ ) for a discussion 



of this in the context of the Langevin method (see Section 15.21) . 

This problem can be alleviated by choosing e randomly from some distribution. Even if the 
mean of this dist ribution is too large, suitably small values for e may be chosen occasionally. 
(See Section l3~2l for another reason to randomly vary the stepsize.) The random choice of e 
should be done once at the start of a trajectory, not for every leapfrog step, since even if all 
the choices are below the stability limit, random changes at each step lead to a random- walk 
in the error for H, rather than the bounded error that is illustrated in Figure |3j 

The "short-cut" procedures described in Section 15.61 can be seen as ways of saving com- 
putation time when a randomly chosen stepsize in inappropriate. 



What trajectory length? Choosing a suitable trajectory length is crucial if HMC is to 
explore the state space systematically, rather than by a random walk. Many distributions are 
difficult to sample from because they are tightly constrained in some directions, but much 
less constrained in other directions. Exploring the less constrained directions is best done 
using trajectories that are long enough to reach a point that is far from the current point 
in that direction. Trajectories can be too long, however, as is illustrated in Figure [3j The 
trajectory shown on the left of that figure is a bit too long, since it reverses direction and 
then ends at a point that might have been reached with a trajectory about half its length. 
If the trajectory were a bit longer, the result could be even worse, since the trajectory would 
not only take longer to compute, but might also end near its starting point. 

For more complex problems, one cannot expect to select a suitable trajectory length by 
looking at plots like Figure EJ Finding the linear combination of variables that is least 
confined will be difficult, and will be impossible when, as is typical, the least confined 
"direction" is actually a non-linear curve or surface. 

Setting the trajectory length by trial and error therefore seems necessary. For a problem 
thought to be fairly difficult, a trajectory with L = 100 might be a suitable starting point. If 
preliminary runs (with a suitable e, see above) shows that HMC reaches a nearly independent 
point after only one iteration, a smaller value of L might be tried next. (Unless these 
"preliminary" runs are actually sufficient, in which case there is of course no need to do 
more runs.) If instead there is high autocorrelation in the run with L = 100, runs with 
L = 1000 might be tried next. 

As discussed at the ends of Sections 13.21 and I3.3[ randomly varying the length of the 
trajectory (over a fairly small interval) may be desirable, to avoid choosing a trajectory length 
that happens to produce a near-periodicity for some variable or combination of variables. 



26 



MCMC USING HAMILTONIAN DYNAMICS 



Using multiple stepsizes. Using the results in Section I4.1[ we can exploit information 
about the relative scales of variables to improve the performance of HMC. This can be done 
in two equivalent ways. If Si is a suitable scale for q iy we could transform q, by setting 
Qi = <li/ s i, or we could instead use a kinetic energy function of K(p) = p T M~ 1 p with M 
being a diagonal matrix with diagonal elements = 1/sf. 

A third equivalent way to exploit this information, which is often the most convenient, is 
to use different stepsizes for different pairs of position and mome nt um va riables. To see how 
this works, consider a leapfrog update (following equations 12.281 to 12.301) with rrij = 1/sf: 

Pl (t + e/2) = Pl (t) - (e/2)^-(q(t)) (4.8) 

ft(t + e) = q t (t) + es 2 iPi (t + e/2) (4.9) 

dU 

Pi (t + e) = Pi (t + e 2) - e /2 _( g (t + e (4.10) 

oq% 



Define (q^°\p^) to be the state at the beginning of the leapfrog step (ie, (q(t), P (t))), define 
(g^pW) to be the final state (ie, (q(t + e), P (t + e))), and define j/ 1 / 2 ) to be half-way 
momentum (ie, p(t + e/2)). We can now rewrite the leapfrog step above as 

(V2) = pf - (^)fV) (4.H) 



P 

^ = ^ + esU m (4-12) 
Pt ] = pf /2) - (c/2)^) (4.13) 



If we now define rescaled momentum variables, Pi = SiPi, and stepsizes = s»e, we can write 
the leapfrog update as 

pf /2) = Pf ] ~ ^2)^) (4.14) 

4 1] = 4 0) + e tP f /2) (4.15) 



pf ] = Pf ,2) - (^/2)^(g (1) ) (4.16) 



This is just like a leapfrog update with all = 1, but with different stepsizes for different 
(qiiPi) pairs. Of course, the successive values for (q,p) can no longer be interpreted as 
following Hamiltonian dynamics at consistent time points, but that is of no consequence for 
the use of these trajectories in HMC. Note that when we sample for the momentum before 
each trajectory, each pi is drawn independently from a Gaussian distribution with mean zero 
and variance one, regardless of the value of S{. 

This multiple stepsize approach is often more con venient, especially when the estimated 
scales, Si, are not fixe d, as discussed in Section I4T51 and the momentum is only partially 
refreshed (Section l5.3p . 
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4.3 Combining HMC with other MCMC updates 

For some problems, MCMC using Hamiltonian Monte Carlo alone will be impossible or 
undesirable. Two situations where non-HMC updates will be necessary are when some of 
the variables are discrete, and when the derivatives of the log probability density with respect 
to some of the variables are expensive or impossible to compute. HMC can then be feasibly 
applied only to the other variables. Another example is when special MCMC updates have 
been devised that may help convergence in ways that HMC does not - - eg, by moving 
between otherwise isola ted modes — but which are not a complete replacement for HMC. 
As discussed in Section 14.51 below, Bayesian hierarchical models may also be best handled 
with a combination of HMC and other methods such as Gibbs sampling. 

In such circumstances, one or more HMC updates for all or a subset of the variables can 
be alternated with one or more other updates that leave the desired joint distribution of 
all variables invariant. The HMC updates can be viewed as either leaving this same joint 
distribution invariant, or as leaving invariant the conditional distribution of the variables 
that HMC changes, given the current values of the variables that are fixed during the HMC 
update. These are equivalent views, since the joint density can be factored as this conditional 
density times the marginal density of the variables that are fixed, which is just a constant 
from the point of view of a single HMC update, and hence can be left out of the potential 
energy function. 

When both HMC and other updates are used, it may be best to use shorter trajectories 
for HMC than would be used if only HMC were being done. This allows the other updates 
to be done more often, which presumably helps sampling. Finding the optimal tradeoff is 
likely to be difficult, however. A v ariation on HMC that reduces the need for such a tradeoff 
is described below in Section 15.31 

4.4 Scaling with dimensionality 



In Section 13. 3[ one of the main benefits of HMC was illustrated — its ability to avoid the 
inefficient exploration of the state space via a random walk. This benefit is present (in at 
least some degree) for most practical problems. For problems in which the dimensionality 
is moderate to high, another benefit of HMC over simple random-walk Metropolis methods 
is a slower increase in the computation time needed (for a given level of accuracy) as the 
dimensionality increases. (Note that here I will consider only sampling performance after 
equilibrium is reached, not the time needed to approach equilibrium from some initial state 
not typical of the distribution, which is harder to analyse.) 

Creating distributions of increasing dimensionality by replication. To talk about 
how performance scales with dimensionality we need to assume something about how the 
distribution changes with dimensionality, d. 

I will assume that dimensionality increases by adding independent replicas of variables 
- ie, the potential energy function for q = (g 1; . . . , q d ) has the form U(q) = Swj(gj), for 
functions U{ drawn independently from some distribution. Of course, this is not what any 
real practical problem is like, but it may be a reasonable model of the effect of increasing 
dimensionality for some problems — for instance, in statistical physics, distant regions of 
large systems are often nearly independent. Not e that the independence assumption itself 
is not crucial, since as discussed in Section I4.1[ the performance of HMC (and of simple 
random-walk Metropolis) does not change if independence is removed by rotating the coor- 
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dinate system, provided the kinetic energy function (or random-walk proposal distribution) 
is rotationally symmetric. 

For distributions of this form, in which the variables are independent, Gibbs sampling 
will perform very well (assuming it is feasible), producing an independent point after each 
scan of all variables. Applying Metropolis updates to each variable separately will also work 
well, provided the time for a single-variable update does not grow with d. However, these 
methods are not invariant to rotation, so this good performance may not generalize to the 
more interesting distributions for which we hope to obtain insight with the analysis below. 

Scaling of HMC and random-walk Metropolis. Here, I discuss infor mally how well 



HMC and random-walk Metropolis scale with dimension, loosely following iCreutzl (11988 
Section III). 

To begin, Cruetz notes that the following relationship holds when any Metropolis-style 
algorithm is used to sample a density P(x) = (1/Z)exp(—E(x)): 

1 = E[P(x*)/P(x)] = E[exp(-(£(x*) - E(x)))] = E [exp(-A)] (4.17) 

where x is the current state, assumed to be distributed according to P(x), x* is the proposed 
state, and A = E(x*) — E(x). Jensen's inequality then implies that the expectation of the 
energy difference is non-negative: 

E [A] > (4.18) 

The inequality will usually be strict. 

When U(q) = Ewj(^j), and proposals are produced independently for each i, we can 
apply these relationships either to a single variable (or pair of variables) or to the entire 
state. For a single variable (or pair), I will write Ai for E(x*) — E(x), with x = qi and 
E(x) = Ui(qi), or x = (qi,Pi) and E(x) = Ui(qi) + pf/2. For the entire state, I will write A d 
for E(x*) — E(x), with x = q and E(x) = U(q), or x = (q,p) and E(x) = U(q) + K(p)). 
For both random-walk Metropolis and HMC, increasing dimension by replicating variables 
will lead to increasing energy differences, since A^ is the sum of Ai for each variable, each 
of which has positive mean. This will lead to a decrease in the acceptance probability - 
equal to min(l, exp(— A d )) — unless the width of the proposal distribution or the leapfrog 
stepsize is decreased to compensate. 

More specifically, for random-walk Metropolis with proposals that change each variable 
independently, the difference in potential energy between a proposed state and the current 
state will be the sum of independent differences for each variable. If we fix the standard 
deviation, for each proposed change, the mean and the variance of this potential energy 
difference will both grow linearly with d, which will lead to a progressively lower accep- 
tance rate. To maintain reasonable performance, ? will have to decrease as d increases. 
Furthermore, the number of iterations needed to reach a nearly independent point will be 
proportional to q~ 2 , since exploration is via a random walk. 

Similarly, when HMC is used to sample from a distribution in which the components of 
q are independent, using the kinetic energy K{p) = T.pf/2, the different (qi,Pi) pairs do 
not interact during the simulation of a trajectory — each (qi,Pi) pair follows Hamiltonian 
dynamics according to just the one term in the potential energy involving % and the one term 
in the kinetic energy involving p^. There is therefore no need for the length in fictitious time 
of a trajectory to increase with dimensionality. However, acceptance of the end-point of the 
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trajectory is based on the error in H due to the leapfrog approximation, which is the sum of 
the errors pertaining to each (qi,Pi) pair. For a fixed stepsize, e, and fixed trajectory length, 
eL, both the mean and the variance of the error in H grow linearly with d. This will lead 
to a progressively lower acceptance rate as dimensionality increases, if it is not counteracted 
by a decrease in e. The number of leapfrog steps needed to reach an independent point will 
be proportional to e^ 1 . 

To see which method scales better, we need to determine how rapidly we must reduce q 
and e as d increases, in order to maintain a reasonable acceptance rate. As d increases and q 
or e go to zero, A x will go to zero as well. Using a second-order approximation of exp(— A x ) 
as 1 — Ai + Af/2, together with equation (14. 17ft . we find that 

E[Ai] « E[Aj}/2 (4.19) 

It follows from this that the variance of A x is twice the mean of A x (when Ai is small), 
which implies that the variance of A^ is twice the mean of A^ (even when A^ is not small). 
To achieve a good acceptance rate, we must therefore keep the mean of A^ near one, since a 
large mean will not be saved by a similarly large standard deviation (which would produce 
fairly frequent acceptances as A^ occasionally takes on a negative value). 

For random-walk Metropolis with a symmetric proposal distribution, we can see how q 
needs to scale by directly averaging Ai for a proposal and its inverse. Let the proposal 
for one variable be x* = x + c, and suppose that c = a and c = —a are equally likely. 
Approximating U(x*) to second order as U(x) + cU'(x) + c 2 U"(x)/2, we find that the 
average of Ai = U(x*) — U(x) over c = a and c = —a is a 2 U"(x). Averaging this over 
the distribution of a, with standard deviation q, and over the distribution of x, we see 
that E [Ai] is proportional to q 2 . It follows that E [A d ] is proportional to dq 2 , so we can 
maintain a reasonable acceptance rate by letting q be proportional to d^ 1 / 2 . The number of 
iterations needed to reach a nearly independent point will be proportional to q~ 2 , which will 
be proportional to d. The amount of computation time needed will typically be proportional 
to d 2 . 

As discussed at the end of Section |273| the error in H when using the leapfrog discretization 
to simulate a trajectory of a fixed length is proportional to e 2 (for sufficiently small e). The 
error in H for a single (qi,Pi) pair is the same as Ai, so we see that A^ is proportional to 
e A . Equation 14.191 then implies that E [Ax] is also proportional to e A . The average total 
error in H for all variables, E [AJ, will be proportional to cfe 4 , and hence we must make e 
be proportional to d^ 1 ^ to maintain a reasonable acceptance rate. The number of leapfrog 
updates to reach a nearly independent point will therefore grow as d 1//4 , and the amount of 
computation time will typically grow as d 5 ' 4 , which is much better than the d 2 growth for 
random-walk Metropolis. 

Optimal acceptance rates. By extending the analysis above, we can determine what 
the acceptance rate of proposals is when the optimal choice of q or e is used. This is 
helpful when tuning the algorithms — provided, of course, that the distribution sampled 
is high-dimensional, and has properties that are adequately modeled by a distribution with 
replicated variables. 

To find this acceptance rate, we first note that since Metropolis methods satisfy detailed 
balance, the probability of an accepted proposal with A^ negative must be equal to the 
probability of an accepted proposal with A^ positive. Since all proposals with negative 
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Ad are accepted, the acceptance rate is simply twice the probability that a proposal has a 
negative A^. For large d, the Central Limit Theorem implies that the distribution of A^ is 
Gaussian, since it is a sum of d independent Ai values. (This assumes that the variance of 
each Ai is finite.) We saw above that the variance of A^ is twice its mean, E [A J = fi. The 
acceptance probability can therefore be written as follows ( iGupta et al.l . Il990f ). for large d: 



P(accept) = 2^((0 - fj.) /y/2fl) = 2^(-^/J^/2) = a(fi) (4.20) 

where <&(z) is the cumulative distribution function for a Gaussian variable with mean zero 
and variance one. 

For random-walk Metropolis, the cost to obtain an independent point will be proportional 
to 1/ (gk 2 ), where a is the acceptance rate. We saw above that fi = E [A^] is proportional to 
? 2 , so the cost follows the proportionality 

C IW cc l/(a(/i)/i) (4.21) 

Numerical calculation shows that this is minimized when /1 = 2.8 and a(fi) = 0.23. 

For HMC, the cost to obtain an independent point will be proportional to l/(ae), and as 
we saw above, fi is proportional to e 4 . From this we obtain 

C HMC oc 1 / (a(/V /4 ) (4-22) 
Numerical calculation shows that the minimum is when fi = 0.41 and a(fi) = 0.65. 

The same optimal 23% acceptance rate for random-walk Metropolis was previously ob- 
tained using a more formal analysis by iRoberts et al.l ( 19971 ). The optimal 65% acceptance 



rate for HMC that I derive ab ove is consi stent with previous empirical results on distribu- 
tions following the model here ( Neall. 119941. Figure 2), and on real high- dimensional problems 



Creutzl . [19881 . Figures 2 and 3; lSexton and Weingartenl . 119921 . Table 1). Kennedy and Pendleton 



19911 ) obtained explicit and rigorous results for HMC applied to multivariate Gaussian dis- 



tributions. 

Exploring the distribution of potential energy. The better scaling behaviour of HMC 
seen above depends crucially on the resampling of momentum variables. We can see this by 
considering how well the methods explore the distribution of the potential energy, U(q) = 
T,Ui(qi). Because U(q) is a sum of d independent terms, its standard deviation will grow in 
proportion to d 1 ^ 2 . 



Following iCaracciolo et aLl (Il994j ). we note that the expected change in potential energy 



from a single Metropolis update will be no more than order one — intuitively, large upwards 
changes are unlikely to be accepted, and since Metropolis updates satisfy detailed balance, 
large downward changes must also be rare (in equilibrium) . Because changes in U will follow 
a random walk (due again to detailed balance), it will take at least order (d 1 ^ 2 / l) 2 = d 
Metropolis updates to explore the distribution of U. 

In the first step of an HMC iteration, the resampling of momentum variables will typically 
change the kinetic energy by an amount that is proportional to d 1 ^ 2 , since the kinetic energy 
is also a sum of d independent terms, and hence has standard deviation that grows as d 1 ^ 2 
(more precisely, its standard deviation is <i 1//2 /2 1//2 ). If the second step of HMC proposes 
a distant point, this change in kinetic energy (and hence in H) will tend, by the end of 
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the trajectory, to have become equally split between kinetic and potential energy. If the 
end-point of this trajectory is accepted, the change in potential energy from a single HMC 
iteration will be proportional to d 1 ^ 2 , comparable to its overall range of variation. So, in 
contrast to random-walk Metropolis, we may hope that only a few HMC iterations will be 
sufficient to move to a nearly independent point, even for high-dimensional distributions. 

Analysing how well methods explore the distribution of U can also provide insight into 
their performance on distributions that aren't well modeled by replication of variables, as 
we will see in the next section. 



4.5 HMC for hierarchical models 

Many Bayesian models are defined hierarchically. A large set of low-level parameters have 
prior distributions that are determined by fewer higher-level "hyperparameters" , which in 
turn may have priors determined by yet-higher-level hyperparameters. For example, in a 
regression model with many predictor variables, the regression coefficients might be given 
Gaussian prior distributions, with mean of zero and a variance that is a hyperparameter. This 
hyperparameter could be given a broad prior distribution, so that its posterior distribution 
is determined mostly by the data. 

One could apply HMC to these models in an obvious way (after taking the logs of variance 
hyperparameters, so they will be unconstrained). However, it may be better to a pply HMC 
only to the lower-level parameters, for reasons I will now discuss. (See Section H~3l for general 
discussion of applying HMC to a subset of variables.) 



I will use my work on Bayesian neural network models (iNeall . 1996a) as an example. 



Such models typically have several groups of low-level parameters, each with an associated 
variance hyperparameter. The posterior distribution of these hyperparameters reflects im- 
portant aspects of the data, such as which predictor variables are most relevant to the task. 
The efficiency with which values for these hyperparameters are sampled from the posterior 
distribution can often determine the overall efficiency of the MCMC method. 

I use HMC only for the low-level parameters in Bayesian neural network models, with 
the hyperparameters being fixed during an HMC update. These HMC updates alternate 
with Gibbs sampling updates of the hyperparameters, which (in the simpler versions of the 
models) are independent given the low-level parameters, and have conditional distributions of 
standard form. By using HMC only for the low-level parameters, the leapfrog stepsizes used 
can be set using heuristics that are based on the current h yper parameter values. (I use the 
multiple stepsize approach described at the end of Section I4.2[ equivalent to using different 
mass values, m«, for different parameters.) For example, the size of the network "weights" 
on connections out of a "hidden unit" determine how sensitive the likelihood function is 
to changes in weights on connections into the hidden unit; the variance of the weights on 
these outgoing connections is therefore useful in setting the stepsize for the weights on the 
incoming connections. If the hyperparameters were changed by the same HMC updates as 
change the lower-level parameters, using them to set stepsizes would not be valid, since a 
reversed trajectory would use different stepsizes, and hence not retrace the original trajectory. 
Without a good way to set stepsizes, HMC for the low-level parameters would likely be much 
less efficient. 



Chool (120001 ) bypassed this problem by using a modification of HMC in which trajecto- 
ries are simulated by alternating leapfrog steps that update only the hyperparameters with 
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leapfrog steps that update only the low-level parameters. This procedure maintains both 
reversibility and volume-preservation (though not necessarily symplecticness), while allowing 
the stepsizes for the low-level parameters to be set using the current values of the hyper- 
parameters (and vice versa). However, performance did not improve as hoped because of a 
second issue with hierarchical models. 



In these Bayesian neural network models, and many other hierarchical models, the joint 
distribution of both low-level parameters and hyperparameters is highly skewed, with the 
probability density varying hugely from one region of high posterior probability to another. 
Unless the hyperparameters controlling the variances of low-level parameters have very nar- 
row posterior distributions, the joint posterior density for hyperparameters and low-level 
parameters will vary greatly from when the variance is low to when it is high. 

For instance, suppose that in its region of high posterior probability, a variance hyperpa- 
rameter varies by a factor of four. If this hyperparameter controls 1000 low-level parameters, 
their typical prior probability density will vary by a factor of 2 1000 = 1.07 x 10 301 , corre- 
sponding to a potential energy range of log(2 1000 ) = 693, with a standard deviation of 

693/12 1 / 2 = 200 (since the variance of a uniform distribution is one twelfth of its range). 
As discussed at the end of Section 14.41 one HMC iteration changes the energy only through 
the resampling of the momentum variables, which at best leads to a change in potential 
energy with standard deviation of about c/ 1//2 /2 3//2 . For this example, with 1000 low-level 
parameters, this is 11.2, so about (200/1 1.2) 2 = 319 HMC iterations will be needed to reach 
an independent point. 

One might obtain similar performance for this example using Gibbs sampling. However, 
for neural network models, there is no feasible way of using Gibbs sampling for the posterior 
distribution of the low-level parameters, but HMC can be applied to the conditional distri- 
bution of the low-level parameters given the hyperparameters. Gibbs sampling can then be 
used to update the hyperparameters. As we have seen, performance would not be improved 
by trying to update the hyperparameters with HMC as well, and updating them by Gibbs 
sampling is easier. 



iChool (2000) tried another approach that could potentially improve on this — reparam- 



eterizing low-level parameters 0j, all with variance exp(re), by letting 6i = 0j exp(fi;/2), and 
then sampling for k and the fa using HMC. The reparameterization eliminates the extreme 
variation in probability density that HMC cannot efficiently sample. However, he found 
that it is difficult to set a suitable stepsize for k, and that the error in H tended to grow 
with trajectory length, unlike the typical situation when HM C is used only for the low-level 
parameters. Use of "tempering" techniques (see Section !5.Tj) is another possibility. 

Even though it does not eliminate all difficulties, HMC is very useful for Bayesian neural 
network models — indeed, without it, they might not be feasible for most applications. Using 
HMC for at least the low-lev el parameter can produce similar benefits for other hierarchical 
models (eg, llshwaranl . |1999[ ). especially when the posterior correlations of these low-level 



parameters are high. As in any application of HMC, however, careful tuning of the stepsize 
and trajectory length is generally necessary. 
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5 Extensions and variations on HMC 

The basic HMC algorithm of Figure |2] can be modified in many ways, either to improve its 
efficiency, or to make it useful for a wider range of distributions. In this section, I will start by 
discussing alternatives to the leapfrog discretization of Hamilton's equations, and also show 
how HMC can handle distributions with constraints on the variables (eg, variables that must 
be positive) . I will then discuss a special case of HMC — when only one leapfrog step is done 
- and show how it can be extended to produce an alternative method of avoiding random 
walks, which may be useful when not all variables are updated by HMC. Most applications 
of HMC can benefit from using a variant in which "windows" of states are used to increase 
the acceptance probability. Another widely applicable technique is to use approximations to 
the Hamiltonian to compute trajectories, while still obtaining correct results by using the 
exact Hamiltonian when deciding whether to accept the endpoint of the trajectory. Tuning 
of HMC may be assisted by using a "short-cut" method that avoids computing the whole 
trajectory when the stepsize is inappropriate. Tempering methods have potential to help 
with distributions having multiple modes, or which are highly skewed. 

There are many other variations that I will not be able to review here, such as the use of a 
"shado w Hamiltonian" that is exactly c onserved by the inexact simulation of the real Hamil- 
tonian ( llzagguirre and Hamptonl . 120041 ) . and the use of symplectic integr ation methods more 



sophisticate d than the leapfrog m ethod (eg, iCreutz and Gockschl . 119891 ). including a recent 



proposal by lGirolami et ail (j2009f ) of a symplectic integrator for a non-separable Hamiltonian 



in which M in the kinetic energy of (I2.6P depends on q, allowing for "adaptation" based on 
local information. 



5.1 Discretization by splitting: handling constraints and other applications 

The leapfrog method is not the only discretization of Hamilton's equations that is reversible 
and volume-preserving, and hence can be used for Hamiltonian Monte Carlo. Many "sym- 
plectic integration methods" have been devised, mostly for applications other than HMC 
(eg, simulating the solar system for millions of years to test its stability). It is possible to 
de vise methods that have a high er order of accuracy than the leapfrog method (for example, 
see iMcLachlan and Ate la. 1992). Using such a method for HMC will produce asymptotically 



better performance than the leapfrog method, as dimensionality increases. Experience has 
shown, however, that the leapfrog method is hard to beat in practice. 

Nevertheless, it is worth taking a more general look at how Hamiltonian dynamics can 
be simulated, since this also points to how constraints on the variables can be handled, as 
well as possible improvements such as exploiting partial analytic solutions. 

Splitting the Hamiltonian. Many symplectic discretizations of Hamiltonian dynamics 
can be derived by "splitting" the Hamiltonian into several terms, and then for each term 
in succession, simulating the dynamics defined by that term for some small time step, then 
repeating this procedure until the desired total simulation time is reached. If the simulation 
for each term can be done analytically, we obtain a symplectic approximation to the dynamics 
that is feasible to implement. 



This general scheme is des cribed by iLeimkuhler and Reichl (I2004L Section 4.2) and by 



Sexton and Weingartenl (|1992[ ). Suppose that the Hamiltonian can be written as a sum of k 



terms, as follows: 

H(q,p) = Hi(q,p) + H 2 (q,p) + ••• + H k ^(q,p) + H k (q,p) (5.1) 
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Suppose also that we can exactly implement Hamiltonian dynamics based on each Hi, for 
i = 1, . . . , k, wit h Tj iF being the mapping defined by applying dynamics based on Hi for time 
e. As shown by iLeimkuhler and Reichl . if the Hi are twice differentiable, the composition 
of these mappings, Ti iE o T 2y£ o • • • o Tfc_i j£ o T fe e) is a valid discretization of Hamiltonian 
dynamics based on H, which will reproduce the exact dynamics in the limit as e goes to 
zero, with global error of order e or less. 

Furthermore, this discretization will preserve volume, and will be symplectic, since these 
properties are satisfied by each of the T itE mappings. The discretization will also be re- 
versible if the sequen ce o f Hi is symmetrical — ie, Hi(q,p) = H k _ i+1 (q,p). As mentioned 
at the end of Section 12.31 any reversible method must have global error of even order in e 
(ILeimkuhler and Reichl . I2004I . Section 4.3.3), which means the global error must be of order 
e 2 or better. 



We can derive the leapfrog method from a symmetrical splitting of the Hamiltonian. If 
H(q,p) = U(q) + K(p), we can write the Hamiltonian as 

H(q,p) = U(q)/2 + K(p) + U(q)/2 (5.2) 

which corresponds to a split with H\{q,p) = H^(q, p) = U(q )/2 and H 2 (q,p) = K{p). 
Hamiltonian dynamics based on Hi is (equations ( 12. II) and ( 12. 2 In : 

dq * dHl = (5.3) 



dt dpi 

djH _dHi 

dt dqi 2 dqi 



(5.4) 



Applying this dynamics for t ime e just adds — (e/2) dU /dqi to each pi, which is the first part 
of a leapfrog step (equation (12.281) ). The dynamics based on H 2 is as follows: 

dqi dH 2 OK 

(5.5) 



dt dpi dp 

dpi dH 2 



dt dqi 



(5.6) 



If K(p) = (1/2) YlPi/ m ii a PPly m g this dynamics for time e results in adding epi/rrii to each 
qi, which is the second part of a leapfro g step (equation (12.291) ). Finally, H 3 produces the 
third part of a leapfrog step (equation (12.301) ). which is the same as the first part, since 
H 3 = Hi. 

Splitting to exploit partial analytical solutions. One situation where splitting can 
help is when the potential energy contains a term that can, on its own, be handled analyti- 
cally. For example, the potential energy for a Bayesian posterior distribution will be the sum 
of minus the log prior density for the parameters and minus the log likelihood. If the prior 
is Gaussian, the log prior density term will be quadratic, and can be handled analytically 
(eg, see the one dimensional example at the end of Section |2~T1) . 



We can modify the leapfrog method for this situation by using a modified split. Suppose 
U(q) = U (q) + Ui(q), with U being analytically tractable, in conjunction with the kinetic 
energy, K{p). We use the split 

H(q,p) = Z7i(g)/2 + [U (q) + K(p)] + U x {q)/2 (5.7) 
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Ie, Hi(q,p) = H 3 (q,p) = Ui(q)/2 and H 2 (q,p) = U (q) + K(p). The first and last half-steps 
for p are the same as for ordinary leapfrog, based on U\ alone. The middle full step for 
q, which in ordinary leapfrog just adds ep to q, is replaced by the analytical solution for 
following the exact dynamics based on the Hamiltonian Uo(q) + K(p) for time e. 

With this procedure, it should be possible to use a larger stepsize (and hence use fewer 
steps in a trajectory), since part of the potential energy has been separated out and handled 
exactly. The benefit of handling the prior exactly may be limited, however, since the prior 
is usually dominated by the likelihood. 

Splitting potential energies with variable computational costs. Splitting can also 
help if the potential energy function can be split into two terms, one of whic h requires 
less computation time to evaluate than the other (jSexton and Weingartenl . I1992I ). Suppose 
U(q) = U (q) + Ui(q), with U being cheaper to compute than Ui, and let the kinetic energy 
be K{p)- We can use the following split, for some M > 1: 

M 

H(q,p) = + [ U oio)/ 2M + K(p)/M + U (q)/2M] + Z7i(g)/2 (5.8) 

m=l 

We label the k = 3M + 2 terms as H^q.p) = H k (q,p) = Ux{q)/2 and for i = 1, . . . , M, 
H 3i _i(q,p) = H 3i+ i(q,p) = Uo(q)/2M and H 3i (q,p) = K{p)/M. The resulting discretization 
can be seen as a nested leapfrog method. The M inner leapfrog steps involve only Uq, and 
use an effective stepsize of e/M. The outer leapfrog step takes half steps for p using only 
Ui, and replaces the update for q in the middle with the M inner leapfrog steps. 

If Uq is much cheaper to compute than Ui, we can use a large value for M without 
increasing computation time by much. The stepsize, e, that we can use will then be limited 
mostly by the properties of Ui, since the effective stepsize for Uq is much smaller, e/M. 
Using a bigger e than with the standard leapfrog method will usually be possible, and hence 
we will need fewer steps in a trajectory, with fewer computations of U±. 

Splitting according to data subsets. When sampling from the posterior distribution 
for a Bayesian model of independent data points, it may be possible to save computation 
time by splitting the potential energy into terms for subsets of the data. 

Suppose we partition the data into subsets S m , for i = 1,...,M, typically of roughly 
equal size. We can then write the log likelihood function as £(q) = ^m =1 d(<?), where £ m 
is the log likelihood function based on the data points in S m . If 7r(q) is the prior density for 
the parameters, we can let U m (q) = — log(7r(g))/M — £ m (q), and split the Hamiltonian as 
follows: 

M 

H(q,p) = [Um{q)/2 + K(p)/M + U m (q)/2\ (5.9) 

m=l 

Ie, we let the k = 3M terms be H 3m „ 2 (q,p) = H 3m (q,p) = U m (q)/2 and H 3m _i(q,p) = 
K(p)/m. The resulting discretization with stepsize e effectively performs M leapfrog steps 
with stepsize e/M, with the mth step using MU m as the potential energy function. 

This scheme can be beneficial if the data set is redundant, with many data points that 
are similar. We then expect MU m (q) to be approximately the same as U(q), and we might 
hope that we could set e to be M times larger than with the standard leapfrog method, 
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obtaining similar results with M times less computation. In practice, however, the error in 
H at the end of the traject ory will be l arger than with standard leapfrog, so the gain will 
be less than this. I found (iNeall . Il996al . Sections 3.5.1 and 3.5.2) that the method can be 



beneficial for neural network models, esp ecially when combined with the windowed HMC 
procedure described below in Section 15.41 

Note that unlike the other examples above, this split is not symmetrical, and hence the 
resulting discretization is not reversible. However, it can still be used to produce a proposal 
for HMC as long as the labelling of the subsets is randomized for each iteration, so that 
the reverse trajectory has the same probability of being produced as the forward trajectory. 
(It is possible, however, that some symmetrical variation on this split might produce better 
results.) 



Handling constraints. An argument based on splitting shows how to handle constraints 
on the variables being sampled. Here, I will consider only separate constraints on some 
subset of the variables, with the constraint on q$ taking the form qi < Ui, or qi > k, or both. 
A similar scheme can handle constraints taking the form G(q) > 0, for any differentiable 
function G. 



We can impose constraints on variables by letting the potential energy be infinite for 
values of q that violate any of the constraints, which will give such points probability zero. 
To see how to handle such infinite potential energies, we can look at a limit of potential 
energy functions that approach infinity, and the corresponding limit of the dynamics. 

To illustrate, suppose that U*(q) is the potential energy ignoring constraints, and that % 
is constrained to be less than «j. We can take the limit as r — > oo of the following potential 
energy function (which is one of many that could be used): 

{0 if qi < Ui 

r r+l( n _„,\r : fn ^ „ ( 5 - 10 ) 
' \Qi "'i) u qi Ui 

It is easy to see that lim^oo C r {q^ ui) is zero for any < Ui and infinity for any > Mj. 
For any finite r > 1, U(q) is differentiable, so we can use it to define Hamiltonian dynamics. 

To simulate the dynamics based on this U(q), with a kinetic energy K(p) = (1/2) Y2pl/ m ii 
we can use the split of equation (15. 7ft . with U\(q) = U*(q) and Uo(q) = C r (qi,Ui): 

H(q,p) = U.{q)/2 + [C r {q h ui) + K(p)] + U.(q)/2 (5.11) 

This produ ces a var iation on the leapfrog method in which the half- steps for p (equa- 
tions (12. 28ft and 12.29( 1 remain the same, but the full step for q (equation (12.291) ) is modified 
to account for the constraint on qi. After computing q[ = qi(t) + epiit + e /2)/m il we check 
if q[ > Ui. If not, the value of C r (qi, ui) must be zero all along the path from q^ to q[, and we 
can set q(t + e) to q[. But if q[ > u iy the dynamics based on the Hamiltonian C r (qi, Ui) + K{p) 
will be affected by the C r term. This term can be seen as a steep hill, which will be climbed 
as qi moves past Ui, until the point is reached where C r is equal to the previous value of 
(l/2)pi /rrii, at which point pi will be zero. (If r is sufficiently large, as it will be in the limit 
as r — > oo, this point will be reached before the end of the full step.) We will then fall down 
the hill, with pi taking on increasingly negative values, until we again reach q^ = Ui, when 
Pi will be just the negative of the original value of p%. We then continue, now moving in the 
opposite direction, away from the upper limit. 
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For each variable , i = 1, . . . , d : 

1) Let p' { = pi(t + e/2) 

2) Let q\ = q^t) + ep'Jmi 

3) If qi is constrained, repeat the following until q\ 
satisfies all constraints: 

a) If qi has an upper constraint , and q[ > Ui 

Let q[ = u t - (q[ - and p\ = -p\ 

b) If qi has a lower constraint , and q[ < li 

Let q[ = h + (h - and p\ = -p\ 

4) Let qi(t + e) = q\ and Pi (t + e/2) = p\ 

Figure 8: Modification to the leapfrog update of q (equation ( I2.29p ) to handle constraints of 
the form qi < Ui or qi < k- 

If several variables have constraints, we must follow this procedure for each, and if a 
variable has both upper and lower constraints, we must repeat the proced ure u ntil neither 
constraint is violated. The end result is that the full step for q of equation (I2.29|) is replaced 
by the procedure shown in Figure EJ Intuitively, the trajectory just bounces off the "walls" 
given by the constraints. If C/*(g) is constant, these bounces are the only interesting aspect 
of the dynam ics, and the procedure is sometimes referred to as "billiards" (see, for example, 
Ruianl . fl997T l. 



5.2 Taking one step at a time — the Langevin method 



A special case of Hamiltonian Monte Carlo arises when the trajectory used to propose a new 
state consists of only a single leapfrog step. Suppose that we use the kinetic energy K(p) = 
(V 2 ) Eft 2 - An iteration of HMC with one leapfrog step can be expressed in the following 
way. We sample values for the momentum variables, p, from their Gaussian distributions 
with mean zero and variance one, and then propose new values, q* and p*, as follows: 



Pi 



e 2 dU, , 

ft " ^dq-^ + m 

e dU . , e dU . . 



We accept q* as the new state with probability 

min [l, exp (- (U(q*) - U(q)) - ~ ^((pj) 

i 

and otherwise keep q as the new state. 



Pi, 



(5.12) 



(5.13) 



(5.14) 



Equation (I5.12p is known in physics as one type of "Langevin equation" , and this method 
is t herefore known a s Langevin Monte Carlo (LMC) in the the lattice field theory literature 
(eg. lKennedvllT99nl ). 
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One can also remove any explicit mention of momentum variables, and view this method 
as performing a Metropolis-Hastings update in which q* is proposed from a Gaussian distri- 
bution where the q* are independent, with means of g» — (e /2)[dU /dqi](q), and variances of 
e 2 . Since this proposal is not symmetrical, it must be accepted or rejected based on both the 
ratio of the probability densities of q* and q and on the ratio of the probability densities for 
proposing q from q* and vice versa (lHastingsl . 119701 ) . To see the equivalence with HMC using 
one leapfrog step, we can write the Metropolis-Hastings acceptance probability as follows: 



mm 



exp( 



17(9*)) A ex P ( " (* " It + ( £ V2) [dU/d qi }(q*)) 2 /2e 2 ) 



n 



exp(-*7(g)) 1 A exp ( _ ( ? * _ q . + ( e 2/ 2 ) [dU / dq^q)) 2 / 2e 2 ) 



(5.15) 



To see that this is the same as (I5.14p . note that using equations (I5.12p and ( I5.13p . we can 
write 

e 2 dU ] 



P 



P 



Qi - Qi + 



e^dU 
~2dq~i [ 



(5.16) 
(5.17) 



After substituting these into (15.14p . it is straightforward to see the equivalence to ( 15. 15ft . 

In this Metropolis-Hasting s form , the Langevin Monte Carlo method was first proposed by 
Rossky, Doll, and Friedman (1978), for use in physical simulations. Approximate La ngevin 
meth ods without an accept/reject step can also be used (for a discussion of this, see iNeall . 
199 3], Section 5.3) — a s , for in stance, in a paper on statistical inference for complex models by 
Grenander and Millerl ( 119901 ) . where also an accept/reject step is proposed in the discussion 
by J. Besag (p. 591). 

Although LMC can be seen as a special case of HMC, its properties are quite different. 
Since LMC updates are reversible, and generally make only small changes to the state (since 
e typically cannot be very large), LMC will explore the distribution via an inefficient random 
walk, just like random-walk Metropolis updates. 

However, LMC has better scaling behaviour than random-walk Metro polis as dimension- 
ality increases, as can be seen from an analysis paralleling that in Section I4T41 (ICreutzl . 11988 : 
iKennedyl . [1990). The local error of the leapfrog step is of order e 3 , so E [Af], the average 
squared error in H from one variable, will be of order e 6 . From equation (14.191) . E [A] will 
also be of order e 6 , and with d independent variables, E [A<j] will be of order de 6 , so that e 

must scale as d" 1 ^ 6 in order to maintain a reasonable acceptance rate. Since LMC explores 
the distribution via a random walk, the number of iterations needed to reach a nearly inde- 
pendent point will be proportional to e~ 2 , which grows as d 1 ^ 3 , and the computation time 
to reach a nearly independent point grows as c/ 4 / 3 . This is better than the d 2 growth in 
computation time for random walk Metropolis, but worse than the d 5 ^ growth when HMC 
is used with trajectories that are long enough to reach a nearly independent point. 



We can also find what the acceptance rate for LMC will be when the optimal e is used, 
when sampling a distribution with independent variables replicated d times. As for random 
walk Metr opolis and HMC, the acceptance rate is given in terms of \x = E [A^] by equa- 
tion (14.201) . The cost of obtaining a nearly independent point using LMC is proportional to 
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l/(a(/i)e 2 ), and since \i is proportional to e 6 , we can write the cost as 

C LMC ex l/(a(/i)// 3 ) (5.18) 

Numerical calculation shows that this is m inimized when a(/i) is 0.57 a result obtained more 
formally by lRoberts and Rosenthal (1998). This may be useful for tuning, if the behaviour of 
LMC for the distribution being sampled resembles its behaviour when sampling for replicated 
independent variables. 



5.3 Partial momentum refreshment — another way to avoid random walks 

The single leapfrog step used in the Langevin Monte Carlo algorithm will usually not be 
sufficient to move to a nearly independent point, so LMC will explore the distribution via 
an inefficient random walk. This is why HMC is typically used with trajectories of many 
leapfrog steps. An alternative that can suppress random walk behaviour even when trajec- 
tories consist of just one le apfrog step is to only partially refresh the momentum between 
trajectories, as proposed by lHorowitzl (119911 ). 



Suppose that the kinetic energy has the typical form K(p) = p T M~ 1 p/2. The following 
update for p will leave invariant the distribution for the momentum (Gaussian with mean 
zero and covariance M): 

p' = ap + (l-a 2 ) 1/2 n (5.19) 

Here, a is any constant in the interval [—1,-1-1], and n is a Gaussian random vector with 
mean zero and covariance matrix M. To see this, note that if p has the required Gaussian 
distribution, the distribution of p' will also be Gaussian (since it is a linear combination of 
independent Gaussians), will have mean 0, and will have covariance a 2 M + (1 — a 2 )M = M. 

If a is only slightly less than one, p' will be similar to p, but repeated updates of this 
sort will eventually produce a value for the momentum variables almost independent of 
the initial value. When a = 0, p' is just set to a random value drawn from its Gaussian 
distribution, independent of its previous value. Note that when M is diagonal, the update 
of each momentum variable, Pi, is independent of the updates of other momentum variables. 

The partial momentum update of equation (I5.19P can be substituted for the full replace- 
ment of the momentum in the standard HMC algorithm. This gives a generalized HMC 
algorithm in which an iteration consists of three steps: 

1) Update the momentum variables using equation (15.191) . Let the new momentum be p' . 

2) Propose a new state, (q*,p*), by applying L leapfrog steps with stepsize e, starting at 
(q,p'), and then negating the momentum. Accept {q*,p*) with probability 



mm 



1, exp{-U{q*) + U{q) - K(p*) + K{p')) (5.20) 



If (q*,p*) is accepted, let (q",p") = (q*,p*)', otherwise, let (q",p") = (q,p')- 
3) Negate the momentum, so that the new state is (q", —p"). 

The transitions in each of these steps — (q,p) — > {q,p'), (q,p') (q",p"), an d (q",p") — >■ 
(q", —p") — leave the canonical distribution for (q,p) invariant. The entire update therefore 
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also leaves the canonical distribution invariant. The three transitions also each satisfy de- 
tailed balance, but the sequential combination of the three does not satisfy detailed balance 
(except when a = 0). This is crucial, since if the combination were reversible, it would still 
result in random walk behaviour when L is small. 



Note that omitting step (3) above would result in a valid algorithm, but then, far from 
suppressing random walks, the method (with a close to one) would produce nearly back- 
and-forth motion, since the direction of motion would reverse with every trajectory accepted 
in step (2). With the reversal in step (3), motion continues in the same direction as long 
as the trajectories in step (2) are accepted, since the two negations of p will cancel. Motion 
reverses whenever a trajectory is rejected, so if random walk behaviour is to be suppressed, 
the rejection rate must be kept small. 

If a = 0, the above algorithm is the same as standard HMC, since step (1) will completely 
replace the momentum variables, step (2) is the same as for standard HMC, and step (3) 
will have no effect, since the momentum will be immediately replaced anyway, in step (1) of 
the next iteration. 



Since this algorithm can be seen as a generalization of standard HMC, with an additional 
a parameter, one might thi nk it will offer an improvemen t, provided that a is tuned for 
best performance. However, Kennedy and Pendleton! (120011 ) show that when the method is 
applied to high-dimensional multivariate Gaussian distributions only a small constant factor 
improvement is obtained, with no better scaling with dimensionality. Best performance is 
obtained using long trajectories (L large), and a value for a that is not very close to one 
(but not zero, so the optimum choice is not standard HMC). If L is small, the need to keep 
the rejection rate very low (by using a small e), as needed to suppress random walks, makes 
the method less advantageous than standard HMC. 



It is disappointing that only a small improvement is obtained with this generalization when 
sampling a multivariate Gaussian, due to limitations that likely apply to other distributions 
as well. However, the metho d m ay be m ore useful than one would think from this. For 
reasons discussed in Sections 14.31 and 14.51 we will often combine HMC updates with other 
MCMC updates (perhaps for variables not changed by HMC). There may then be a tradeoff 
between using long trajectories to make HMC more efficient, and using shorter trajectories so 
that the other MCMC updates can be done more often. If shorter-than-optimal trajectories 
are to be used for this reason, setting a greater than zero can reduce the random walk 
behaviour that would otherwise result. 



Furthermore, rejection rates can be reduced using the "window" method described in 
the next section. An analysis of partial momentum refreshment combined with the window 
method might find that using trajectories of moderate length in conjunction with a value for 
a greater than zero produces a more substantial improvement. 



5.4 Acceptance using windows of states 



Figure [3] (right plot) shows how the error in H varies along a typical trajectory computed 
with the leapfrog method. Rapid oscillations occur, here with a period of between 2 and 
3 leapfrog steps, due to errors in simulating the motion in the most confined direction (or 
directions, for higher-dimensional distributions). When a long trajectory is used to propose a 
state for HMC, it is essentially random whether the trajectory ends at a state where the error 
in H is negative or close to zero, and hence will be accepted with probability close to one, or 
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whether it happens to end at a state with a large positive error in if, and a correspondingly 
lower acceptance probability. If somehow we could smooth out these oscillations, we might 
obtain a high probability of acceptance for all trajectories. 

I introduced a method for achievi ng th i s resu lt that uses "windows" of states at the 
beginning and end of the trajectory (jNeall . [1994). Here, I will present the method as an 



application of a general technique in which we probabilistically map to a state in a different 
space, perform a Markov chain t ransition in this new space, and then probabilistically map 
back to our original state space (jNeall . 120061 ). 



Our original state space consists of pairs, (q,p), of position and momentum variables. We 
will map to a sequence of W pairs, [(qo,Po), ■ ■ ■ , (lw-i,Pw-i)}, m which each (qi,pi) for i>0 
is the result of applying one leapfrog step (with some fixed stepsize, e) to (gj_i,pj_i). Note 
that even though a point in the new space seems to consist of W times as many numbers 
as a point in the original space, the real dimensionality of the new space is the same as the 
old, since the whole sequence of W pairs is determined by ((?o>Po)- 

To probabilistically map from (q,p) to a sequence of pairs, [(qo,Po), ■ ■ ■ , (qw-i,Pw-i)]> 
we select s uniformly from {0, . . . , W — 1}, and set (q s ,p s ) m the new state to our current 
state (q,p). The other (qi,Pi) pairs in the new state are obtained using leapfrog steps from 
(q s ,p s ), f° r ^ > s, or backwards leapfrog steps (ie, done with stepsize — e) for i < s. It is 
easy to see, using the fact that leapfrog steps preserve volume, that if our original state is 
distributed with probability density P(q,p), then the probability density of obtaining the 
sequence [(qo,Po), ■ ■ ■ , (qw-i,Pw-i)} by this procedure is 

^ w-i 

P([(qo,Po),---,(vw-i,Pw-i)]) = ^^2 p (^Pi) ( 5 - 21 ) 

i=0 

since we can obtain this sequence from a (q,p) pair that matches any pair in the sequence, 
and the probability is 1/W that we will produce the sequence starting from each of these 
pairs (which happens only if the random selection of s puts the pair at the right place in the 
sequence) . 

Having mapped to a sequence of W pairs, we now perform a Metropolis update that keeps 
the sequence distribution defined by equation (I5.21j) invariant, before mapping back to the 
original state space. To obtain a Metropolis proposal, we perform L — W + 1 leapfrog steps 
(for some L > W — 1), starting from (g^-ijPw-i), producing pairs (qwiPw) to (9l,Pl)- 
We then propose the sequence [(q^, — px), ■ ■ ■ , (qL-w+i, ~Pl~w+i)]- We accept or reject this 
proposed sequence by the usual Metropolis criterion, with the acceptance probability being 



mm 



^2i=L-W+l P(liiPi 
TZo 1 P(Qi,Pi) 



(5.22) 



with P(q,p) oc exp(— H(q,p)). (Note here that H(q,p) = H(q, —p), and that starting from 
the proposed sequence would lead symmetrically to the original sequence being proposed.) 

This Metropolis update leaves us with either the sequence [(qi,PL), • • • , (qL-w+i, Pl-w+i)) 
- called the "accept window", — or the sequence [(qo,Po), ■ ■ ■ , (qw-i,Pw-i)) ~ called the 
"reject window". (Note that these windows will overlap if L+ 1 < 2W.) We label the pairs 
in the window chosen as [(qo,Po), • • • , {q^-uPw-i)]- We now produce a final state for the 
windowed HMC update by probabilistically mapping from this sequence to a single pair, 
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choosing (qt,pt) with probability 



(5.23) 



If the sequence in the chosen window was distributed according to equation (I5.21j) . the pair 
{O-tiPt) chosen will be distributed according to P(q,p) oc exp(— H(q,p)), as desired. To 
see this, let (qt+mPt+n) be the result of applying n leapfrog steps (backward ones if n < 0) 
starting at (g+,p+). The probability density that {qtiPt) will result from mapping from a 
sequence to a single pair can then be written as follows, considering all sequences that can 
contain (qt ,vt) an d their probabilities: 



k= e -W+l 



F((lt -' >t] = Pfapt) (5.24) 



The entire procedure therefore leaves the correct distribution invariant. 

When W > 1, the potential problem with ergodicity discussed at the end of Section I3~2l 
does not arise, since there is a non-zero probability of moving to a state only one leapfrog 
step away, where q may differ arbitrarily from its value at the current state. 

It might appear that the windowed HMC procedure requires saving all 2W states in the 
accept and reject windows, since any one of these states might become the new state when a 
state is selected from either the accept window or reject window. Actually, however, at most 
three states need to be saved — the start state, so that forward simulation can be resumed 
after the initial backward simulation, plus one state from the reject window and one state 
from the accept window, one of which will become the new state after one of these windows 
is chosen. As states in each window are produced in sequence, a decision is made whether 
the state just produced should replace the state presently saved for that window. Suppose 
the sum of the probability densities of states seen so far is s« = p\ H — • +p%. If the state just 
produced has probability density Pi + i, it replaces the previous state saved from this window 
with probability p i+ i/{si + p i+1 ). 



I showed (iNeall . Il994j ) that, compared to standard HMC, using windows improves the 



performance of HMC by a factor of two or more, on multivariate Gaussian distributions 
in which the standard deviation in some directions is much larger than in other directions. 
This is because the acceptance probability in equation (15.221) uses an average of probability 
densities over states in a window, smoothing out the oscillations in H from inexact simulation 
of the trajectory. Empirically, the advantage of the windowed method was found to increase 
with dimensionality. For high- dimensional distributions, the acceptance probability when 
using the optimal stepsize was approximately 0.85, larger than the theoretical value of 0.65 
for HMC (see Section KM- 

These results for multivariate Gaussian distributions were obtained with a window size, 
W, much less than the trajectory length, L. For less regular distributions, it may be ad- 
vantageous to use a much larger window. When W = L/2, the acceptance test determines 
whether the new state is from the first half of the trajectory (which includes the current 
state) or the second half; the new state is then chosen from one half or the other with prob- 
abilities proportional to the probability densities of the states in that half. This choice of W 
guards against the last few states of the trajectory having low probability density (high H), 
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as might happen if the trajectory had by then entered a region where the stepsize used was 
too big. 

The windowed var iant of HMC may make other variants of HMC more attractive. One 
such variant (Section 15. ip splits the Hamiltonian into many terms corresponding to subsets 
of the data, which tends to make errors in H higher (while saving com puta tion). Errors in H 
have less effect when averaged over windows. As discussed in Section [573| very low rejection 
rates are desirable when using partial momentum refreshment. It is easier to obtain a low 
rejection probability using windows (ie, a less drastic reduction in e is needed), which makes 
partial momentum refreshment more attractive. 



iQin and Liul ( 120011 ) introduced a variant on windowed HMC. In their version, L leapfrog 



steps are done from the start state, with the accept window consisting of the states after the 
last W of these steps. A state from the accept window is then selected with probabilities 
proportional to their probability densities. If the state selected is k states before the end, k 
backwards leapfrog steps are done from the start state, and the states found by these steps 
along with those up to W — k — 1 steps forward of the start state form the reject window. 
The state selected from the accep t win dow then becomes the next state with probability 
given by the analogue of equation (15. 22ft ; otherwise the state remains the same. 

Qin and Liu's procedure is quite similar to the original windowed HMC procedure. One 
disadvantage of Qin and Liu's procedure is that the state is unchanged when the accept 
window is rejected, whereas in the original procedure a state is selected from the reject 
window (which might be the current state, but often will not be). The only other difference 
is that the number of steps from the current state to an accepted state ranges from L — W + l 
to L (average L — (W + l)/2) with Qin and Liu's procedure, versus from L — 2W + 2 to L 
(average L — W + l) for the original windowed HMC procedure, while the number of leapfrog 
steps computed varies from L to L + W — 1 with Qin and Liu's procedure, and is fixed at L 
with the original procedure. These differences are slight if W <C L. Qin and Lin claim that 
their procedure performs better than the original on high-dimensional multivariate Gaussian 
distributions, but their experiments are flawed^ 



IQin and Liul (|2001[ ) also introduce the more useful idea of weighting the states in the accept 



and reject windows non-uniformly, which can be incorporated into the original procedure as 
well. When mapping from the current state to a sequence of W weighted states, the position 
of the current state is chosen with probabilities equal to the weights, and when computing the 
acceptance probability or choosing a state from the accept or reject window, the probability 
densities of states are multiplied by their weights. Qin and Liu use weights that favour states 
more distant from the current state, which could be useful by usually causing movement 
to a distant point, while allowing choice of a nearer point if the distant points have low 
probability density. Alternatively, if one sees a window as a way of smoothing the errors in 
H, symmetrical weights that implement a better "low pass filter" would make sense. 



5.5 Using approximations to compute the trajectory 

The validity of Hamiltonian Monte Carlo does not depend on using the correct Hamiltonian 
when simulating the trajectory. We can instead use some approximate Hamiltonian, as long 

4 In their first comparison, their method computes an average of 55 leapfrog steps per iteration, but the original only 
computes 50 steps, a difference in computation time which if properly accounted for negates the slight advantage they see for 
their procedure. Their second comparison has a similar problem, and it is also clear from an examination of the results (in 
their Table I) that the sampling errors in their comparison are too large for any meaningful conclusions to be drawn. 
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as we simulate the dynamics based on it by a method that is reversible and volume preserving. 
However, the exact Hamiltonian must be used when computing the probability of accepting 
the end-point of the trajectory. There is no n eed to look for an approximation to the kinetic 
energy, when it is of a simple form such as (12.231) . but the potential energy is often much 
more complex and costly to compute — for instance, it may involve the sum of log likelihoods 
based on many data points, if the data cannot be summarized by a simple sufficient statistic. 
When using trajectories of many leapfrog steps, we can therefore save much computation 
time if a fast and accurate approximation to the potential energy is available, while still 
obtaining exact results (apart from the usual sampling variation inherent in MCMC). 

Many ways of approximating the potential energy might be useful. For example, if its 
evaluation requires iterative numerical methods, fewer iterations might be done than are 
necessary to get a result accurate to machine precision. In a Bayesian statistical application, a 
less costly approximation to the unnormalized posterior density (whose log gives the potential 
energy) may be obtainable by simply looking at only a subset of the data. This ma y not 
be a good strategy in general, but I h ave found it useful for Gaussian process models (jNeall . 
19981 ; iRasmussen and Williams! . 120061 ) . for which computation time scales as the cube of the 
number of data points, so that even a small reduction in the number of points produces a 
useful speedup. 

Rasmussenl (120031 ) has proposed approximating the potential energy by modeling it as a 
Gaussian process, that is inferred from values of the potential energy at positions selected 
during an initial exploratory phase. This method assumes only a degree of smoothness of the 
potential energy function, and so could be widely applied. It is limited, however, by the cost 
of Gaussian process inference, and so is most useful for problems of moderate dimensionality 
for which exact evaluation of the potential energy is very costly. 

An interesting possibility, to my knowledge not yet explored, would be to express the 
exact potential energy as the sum of an approximate potential energy and the error in th is 
approximation, and to then apply one of the splitting techniques described in Section [57X1 — 
exploiting either the approximation's analytic tractability (eg, for a Gaussian approximation, 
with quadratic potential energy), or its low computational cost, so that its dynamics can be 
accurately simulated at little cost using many small steps. This would reduce the number 
of evaluations of the gradient of the exact potential energy if the variation in the potential 
energy removed by the approximation term permits a large stepsize for the error term. 



5.6 Short-cut trajectories — adapting the stepsize without adaptation 

One significant disadvantage of Hamiltonian Monte Carlo is that, as discussed in Section |4T2| 
its performance depends critically on the settings of its tuning parameters — which consist 
of at least the leapfrog stepsize, e, and number of leapfrog steps, L, with variations such as 
windowed HMC having additional tuning parameters as well. The optimal choice of trajec- 
tory length (eL) depends on the global extent of the distribution, so finding a good trajectory 
length likely requires examining a substantial number of HMC updates. In contrast, just a 
few leapfrog steps can reveal whether some choice of stepsize is good or bad, which leads to 
the possibility of trying to set the stepsize "adaptively" during an HMC run. 



Recent work on adaptive MCMC methods is reviewed by lAndrieu and Thorns! ([2008). As 



they explain, naively choosing a stepsize for each HMC update based on results of previous 
updates — eg, reducing the stepsize by 20% if the previous 10 trajectories were all rejected, 
and increasing it by 20% if less than two of the 10 previous trajectories were rejected - 



5. Extensions and variations on HMC 



45 



undermines proofs of correctness (in particular, the process is no longer a Markov chain), 
and will in general produce points from the wrong distribution. However, correct results can 
be obtained if the degree of adaptation declines over time. Adaptive methods of this sort 
could be used for HMC, in much the same way as for any other tunable MCMC method. 



An alternative approach (jNeall . 120051 . 120071 ) is to perform MCMC updates with various 



values of the tuning parameters, set according to a schedule that is predetermined or cho- 
sen randomly without reference to the realized states, so that the usual proofs of MCMC 
convergence and error analysis apply, but to do this using MCMC updates that have been 
tweaked so that they require little computation time when the tuning parameters are not 
appropriate for the distribution. Most of the computation time will then be devoted to up- 
dates with appropriate values for the tuning parameters. Effectively, the tuning parameters 
are set adaptively from a computational point of view, but not from a mathematical point 
of view. 



For example, trajectories that are simulated with a stepsize that is much too large can be 
rejected after only a few leapfrog steps, by rejecting whenever the change (either way) in the 
Hamiltonian due to a single step (or a short series of steps) is greater than some threshold 
- ie, we reject if \H(q(t + e), p(t + e)) — H (q(t) , p(t))\ is greater than the threshold. If 
this happens early in the trajectory, little computation time will have been wasted on this 
unsuitable stepsize. Such early termination of trajectories is valid, since any MCMC update 
that satisfies detailed balance will still satisfy detailed balance if it is modified to eliminate 
transitions either way between certain pairs of states. 

With this simple modification, we can randomly choose stepsizes from some distribution 
without wasting much time on those stepsizes that turn out to be much too large. However, 
if we set the threshold small enough to reject when the stepsize is only a bit too large, 
we may terminate trajectories that would have been accepted, perhaps after a substantial 
amount of computation has already been done. Trying to terminate trajectories early when 
the stepsize is smaller than optimal carries a similar risk. 

A less drastic alternative to terminating trajectories when the stepsize seems inappropriate 
is to instead reverse the trajectory. Suppose we perform leapfrog steps in groups of k 
steps. Based on the changes in H over these k steps, we can test whether the stepsize is 
inappropriate — eg, the group may fail the test if the standard deviation of H over the 
k + 1 states is greater than some upper threshold or less than some lower threshold (any 
criterion that would yield the same decision for the reversed sequence is valid). When a 
group of k leapfrog steps fails this test, the trajectory stays at the state where this group 
started, rather than moving k steps forward, and the momentum variables are negated. The 
trajectory will now exactly retrace states previously computed (and which therefore needn't 
be recomputed), until the initial state is reached, at which point new states will again be 
computed. If another group of k steps fails the test, the trajectory will again reverse, after 
which the whole remainder of the trajectory will traverse states already computed, allowing 
its endpoint to be found immediately without further computation. 

This scheme behaves the same as standard HMC if no group of k leapfrog steps fails the 
test. If there are two failures early in the trajectory, little computation time will have been 
wasted on this (most likely) inappropriate stepsize. Between these extremes, it is possible 
that one or two reversals will occur, but not early in the trajectory; the endpoint of the 
trajectory will then usually not be close to the initial state, so the non-negligible computation 
performed will not be wasted (as it would be if the trajectory had been terminated). 
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Such short-cut schemes can be effective at finding good values for a small number of 
tuning parameters, for which good values will be picked reasonably often when drawing 
them randomly. It will not be feasible for setting a la rge number of tuning parameters, 
such as the entries in the "mass matrix" of equation ( 12. 6 I) when dimensionality is high, since 
even if two reversals happen early on, the cost of using inappropriate values of the tuning 
parameters will dominate when appropriate values are chosen only very rarely. 



5.7 Tempering during a trajectory 

Standard Hamiltonian Monte Carlo and the variations described so far have as much dif- 
ficulty moving between modes that are separated by regions of low probability as other 
local MCMC methods, such as random-walk Metropolis or Gibbs sampling. Several general 
schemes have been devised for solving problems with such isolated modes that involve sam- 
pling from a series of distributions that ar e more diffus e than the distribution of interest. 
Such sche mes include parallel temper ing (iGeyerl . 119911 : lEarl and Deeml. 2005). simulated 



tempering (jMarinari and Parisil. IT9 92). tempered transitions ( Neall . 1996 rj[ ) . and annealed 



X O \) ^-j^ 1 T I I ■/ 7 X \" '7 ■ ■/ 7 

importance sampling (INeall . 120011 ). Most commonly, t hese distributions are obtained by 



varying a "temperature" parameter, T, as in equation (13. ip . with T = 1 corresponding to 
the distribution of interest, and larger values of T giving more diffuse distributions. Any of 
these "tempering" methods could be used in conjunction with HMC. However, tempering- 
like behaviour can also be incorporated directly into trajectory used to propose a new state 
in the HMC procedure. 



In the simplest version of such a "tempered trajectory" scheme ( INeall . Il996bl Section 6) 



each leapfrog step in the first half of the trajectory is combined with multiplication of the 
momentum variables by some factor a slightly greater than one, and each leapfrog step in 
the second half of the trajectory is combined with division of the momentum by the same 
factor a. These multiplications and divisions can be done in various ways, as long as the 
result is reversible, and the divisions are paired exactly with multiplications. The most 
symmetrical scheme is to multiply the momentum by \fa before the first half-step for m o- 
mentum (equation ( 12.281) ) and after the second half-step for momentum (equation ( 12.301) ). 
for leapfrog steps in the first half of the trajectory, and correspondingly, to divide the mo- 
mentum by yfoL before the first and after the second half-steps for momentum in the second 
half of the trajectory. (If the trajectory has an odd number of leapfrog steps, for the middle 
leapfrog step of the trajectory, the momentum is multiplied by \/a before the first half-step 
for momentum, and divided by y/a after the second half-step for momentum.) Note that 
most of the multiplications and divisions by y/a are preceded or followed by another such, 
and so can be combined into a single multiplication or division by a. 

It is easy to see that the determinant of the Jacobian matrix for such a tempered trajectory 
is one, just as for standard HMC, so its endpoint can be used as a proposal without any 
need to include a Jacobian factor in the acceptance probability. 

Multiplying the momentum by an a that is slightly greater than one increases the value 
of H(q,p) slightly. If H initially had a value typical of the canonical distribution at T = 1, 
after this multiplication, H will be typical of a value of T that is slightly higher^ Initially, 
the change in H (q, p) = K(p) + U (q) is due entirely to a change in K{p) as p is made bigger, 



5 This assumes that the typical value of if is a continuous function of T, which may not be true for systems that have a 
"phase transition" . Where there is a discontinuity (in practice, a near discontinuity) in the expected value of H as a function 
of T, making small changes to H , as here, may be better than making small changes to T (which may imply big changes in the 
distribution). 
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but subsequent dynamical steps will tend to distribute the increase in H between K and 
U, producing a more diffuse distribution for q than is seen when T — 1. After many such 
multiplications of p by a, values for q can be visited that are very unlikely in the distribution 
at T = 1, allowing movement between modes that are separated by low-probability regions. 
The divisions by a in the second half of the trajectory result in H returning to values that 
are typical for T = 1, but perhaps now in a different mode. 

If a is too large, the probability of accepting the endpoint of a tempered trajectory will 
be small, since H at the endpoint will likely be much larger than H at the initial state. To 
see this, consider a trajectory consisting of only one leapfrog step. If e = 0, so that this 
step does nothing, the multiplication by \fa before the first half step for momentum would 
be exactly cancelled by the division by \fa after the second half step for momentum, so H 
would be unchanged, and the trajectory would be accepted. Since we want something to 
happen, however, we will use a non-zero e, which will on average result in the kinetic energy 
decreasing when the leapfrog step is done, as the increase in H from the multiplication by 
yfa. is redistributed from K alone to both K and U. The division of p by y/a will now not 
cancel the multiplication by \fa — instead, on average, it will reduce H by less than the 
earlier increase. This tendency for H to be larger at the endpoint than at the initial state can 
be lessened by increasing the number of leapfrog steps, say by a factor of R, while reducing 
a to ct 1//i? , which (roughly) maintains the effective temperature reached at the midpoint of 
the trajectory. 

Figure M illustrates tempered trajectories used to sample from an equal mixture of two 
bivariate Gaussian distributions, with means of [0 0] and [10 10], and covariances of / and 
21. Each trajectory consists of 200 leapfrog steps, done with e = 0.3, with tempering done 
as described above with a = 1.04. The left plots show how H varies along the trajectories; 
the right plots show the position coordinates for the trajectories. The top plots are for a 
trajectory starting at q — [—0.4 — 0.9] and p = [0.7 — 0.9], which has an endpoint in the 
other mode around [10 10]. The bottom plots are for a trajectory starting at q — [0.1 1.0] 
and p = [0.5 0.8], which ends in the same mode it begins in. The change in H for the top 
trajectory is 0.69, so it would be accepted with probability exp(— 0.69) = 0.50. The change 
in H for the bottom trajectory is —0.15, so it would be accepted with probability one. 

By using such tempered trajectories, HMC is able to sample these two well-separated 
modes - - 11% of the trajectories move to the other mode and are accepted — whereas 
standard HMC does very poorly, being trapped for a very long time in one of the modes. 
The parameters for the tempered trajectories in Figure [9] where chosen to produce easily 
interpreted pictures, and are not optimal. More efficient sampling is obtained with a much 
smaller number of leapfrog steps, larger stepsize, and larger a — eg, L = 20, e = 0.6, and 
a = 1.5 gives a 6% probability of moving between modes. 

A fundamental limitation of the tempering method described above is that (as for standard 
HMC) the endpoint of the tempered trajectory is unlikely to be accepted if the value for H 
there is much higher that for the initial state. This corresponds to the probability density at 
the endpoint being much lower than at the current state. Consequently, the method will not 
move well between two modes with equal total probability if one mode is high and narrow 
and the other low and broad, especially when the dimensionality is high. (Since acceptance 
is based on the joint density for q and p, there is some slack for moving to a point where the 
density for q alone is different, but not enough to eliminate this problem.) I have proposed 
( jNeall . [1999) a modification that addresses this, in which the point moved to can come from 



anywhere along the tempered trajectory, not just the endpoint. Such a point must be selected 
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Figure 9: Illustration of tempered trajectories on a mixture of two Gaussians. The trajectory- 
shown in the top plots moves between modes; the one shown in the bottom plots ends in the 
same mode. 



based both on its value for H and the accumulated Jacobian factor for that point, which is 
easily calculated, since the determinant of the Jacobian matrix for a multiplication of p by 
a is simply a d , where d is the dimensionality. This modified tempering procedure can not 
only move between modes of differing width, but also move back and forth between the tails 
and the central area of a heavy-tailed distribution. 

More details on these variations on Hamiltonian Monte Carlo can be found in the R 
implementations available from my web page, at www.cs.utoronto.ca/~radford. 
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